Uncork Capital

Developing and managing Hotmail

This interview
has already been picked up and commented
upon (and /.’ed),
but if you have not yet taken a look, I recommend reading this ACM piece on
Hotmail, and what it means to manage one of the largest services of the web.
Hotmail runs on 10,000 servers and involves several petabytes of storage (i.e millions of gigabytes) and serves, according to this Wikipedia article, 221M users who are operating billions of e-mail transactions daily. It is
operated by 100 sysadmins, which is not that large a team.

Phil Smoot, the PM in charge of Hotmail product development out of the
Microsoft Silicon Valley campus, shares a number of insights — from which I
noted the following points regarding automation, versionning, capacity planning, impact analysis and QA:

QA is a challenge in the sense that mimicking Internet loads on our QA
lab machines is a hard engineering problem. The production site
consists of hundreds of services deployed over multiple years, and the
QA lab is relatively small, so re-creating a part of the environment or
a particular issue in the QA lab in a timely fashion is a hard problem.
Manageability is a challenge in that you want to keep your
administrative headcount flat as you scale out the number of machines.
[…] if you can manage five servers you should be able to manage tens of
thousands of servers and hundreds of thousands of servers just by
having everything fully automated — and that all the automation hooks
need to be built in the service from the get-go. Deployment of bits is
an example of code that needs to be automated. You don’t want your
administrators touching individual boxes making manual changes. But on
the other side, we have roll-out plans for deployment that smaller
services probably would not have to consider. For example, when we roll
out a new version of a service to the site, we don’t flip the whole
site at once.
We do some staging, where we’ll validate the new version on a server
and then roll it out to 10 servers and then to 100 servers and then to
1,000 servers — until we get it across the site. This leads to another
interesting problem, which is versioning: the notion that you have to
have multiple versions of software running across the sites at the same
time. That is, version N and N+1 clients need to be able to talk to
version N and N+1 servers and N and N+1 data formats. That problem
arises as you roll out new versions or as you try different
configurations or tunings across the site.
The big thing you think about is cost. How much is this new feature
going to cost? A penny per user over hundreds of millions of users gets
expensive fast. Migration is something you spend more time thinking
about over lots of servers versus a few servers. For example, migrating
terabytes worth of data takes a long time and involves complex capacity
planning and data-center floor and power consumption issues. You also
do more up-front planning around how to go backwards if the new version
fails.
We strive to build tools that can replay live-site transactions and
real-type live-site loads against single nodes. The notion is that the
application itself is logging this data on the live site so that it can
be easily consumed in our QA labs. Then as applications bring in new
functionalities, we want to add these new transactions to the existing
test beds.
The notion of tape backups is probably no longer feasible. Building
systems where we’re just backing up changes — and backing them up to
cheap disks — is probably much more where we’re headed. How you can do
this in a disconnected fashion is an interesting problem. That is, how
are you going to protect the system from viruses and software and
administrative scripting bugs? What you’ll start to see is the emergence of the use of data
replicas and applying changes to those replicas, and ultimately the
requirement that these replicas be disconnected and reattached over
time.
As you go to, let’s say, a commodity model, you have to assume that
everything is going to fail underneath you, that you have to deal with
these failures, that all the data has to be replicated, and that the
system essentially self-heals. For example, if you are writing out
files, you put a checksum in place that you can verify when the file is
read. If it wasn’t correct, then go get the file somewhere else and
repair the old file.
Last word: If you rely on scale up, you’ll probably get killed. You should always be relying on scale out.

Sun Microsystems Founders Unplugged at the Computer History Museum