MessiandNeymar

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Tuesday, March 20, 2012

Internet scale, administrations, and operations

Posted on 3:13 PM by Unknown

I happened across two different "Internet scale" articles over the weekend that are related and worth considering together.

The first one is a few years old, but it had been a while since I'd read it, and I happened to re-read it: On Designing and Deploying Internet-Scale Services. This paper was written by James Hamilton when he was part of the MSN and Windows Live teams at Microsoft, and in the paper he discusses a series of "lessons learned" about building and operating "Internet scale" systems.

The paper is rich with examples and details, rules of thumb, techniques, and approaches. Near the front of the paper, Hamilton distills three rules that he says he learned from Bill Hoffman:

  1. Expect failures. A component may crash or be stopped at any time. Dependent components might fail or be stopped at any time. There will be network failures. Disks will run out of space. Handle all failures gracefully.
  2. Keep things simple. Complexity breeds problems. Simple things are easier to get right. Avoid unnecessary dependencies. Installation should be simple. Failures on one server should have no impact on the rest of the data center.
  3. Automate everything. People make mistakes. People need sleep. People forget things. Automated processes are testable, fixable, and therefore ultimately much more reliable. Automate wherever possible.
You can read more about Hoffman's approach to reliable Internet scale systems in this ACM Queue article.

These are three superb rules, even if they are hard to get right. Of course, the first step to getting them right is to get them out there, in front of everybody, to think about.

More recently, you don't want to miss this great article by Jay Kreps, an engineer on the LinkedIn team: Getting Real About Distributed System Reliability. Again, it's just a treasure trove of lessons learned and techniques for designing and building reliability into your systems from the beginning.

I have come around to the view that the real core difficulty of these systems is operations, not architecture or design. Both are important but good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations. This is quite different from the view of unbreakable, self-healing, self-operating systems that I see being pitched by the more enthusiastic NoSQL hypesters.

Both articles remind me of the Chaos Monkey that the Netflix team use in their development process.

One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.

As Hamilton's article points out, the emphasis on failure handling and on failure testing builds on the decades old work of Professors David Patterson and Armando Fox in their Recovery Oriented Computing and Crash-only Software efforts.

Crash-only programs crash safely and recover quickly. There is only one way to stop such software – by crashing it – and only one way to bring it up – by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users.

When these ideas were first being proposed they faced a certain amount of skepticism, but now, years later, they are accepted wisdom, and the stability and reliability of systems like Amazon, Netflix, and LinkedIn is testament to the fact that these techniques do in fact work.

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Shelter
    I meant to post this as part of my article on Watership Down , but then totally forgot: Shelter In Shelter you experience the wild as a moth...
  • The Legend of 1900: a very short review
    Fifteen years late, we stumbled across The Legend of 1900 . I suspect that 1900 is the sort of movie that many people despise, and a few peo...
  • Rediscovering Watership Down
    As a child, I was a precocious and voracious reader. In my early teens, ravenous and impatient, I raced through Richard Adams's Watershi...
  • Must be a heck of a rainstorm in Donetsk
    During today's Euro 2012 match between Ukraine and France, the game was suspended due to weather conditions, which is a quite rare occur...
  • Beethoven and Jonathan Biss
    I'm really enjoying the latest Coursera class that I'm taking: Exploring Beethoven’s Piano Sonatas . This course takes an inside-out...
  • Starting today, the games count
    In honor of the occasion: The Autumn Wind is a pirate, Blustering in from sea, With a rollocking song, he sweeps along, Swaggering boisterou...
  • Parbuckling
    The enormous project to right and remove the remains of the Costa Concordia is now well underway. There's some nice reporting on the NP...
  • For your weekend reading
    I don't want you to be bored this weekend, so I thought I'd pass along some articles you might find interesting. If not, hopefully y...
  • Are some algorithms simply too hard to implement correctly?
    I recently got around to reading a rather old paper: McKusick and Ganger: Soft Updates: A Technique for Eliminating Most Synchronous Writes ...
  • Don't see me!
    When she was young, and she had done something she was embarrassed by or felt guilty about, my daughter would sometimes hold up her hand to ...

Blog Archive

  • ►  2013 (165)
    • ►  September (14)
    • ►  August (19)
    • ►  July (16)
    • ►  June (17)
    • ►  May (17)
    • ►  April (18)
    • ►  March (24)
    • ►  February (19)
    • ►  January (21)
  • ▼  2012 (335)
    • ►  December (23)
    • ►  November (30)
    • ►  October (33)
    • ►  September (34)
    • ►  August (29)
    • ►  July (39)
    • ►  June (27)
    • ►  May (48)
    • ►  April (32)
    • ▼  March (30)
      • 30 days has September, April, June, and November ...
      • Online reviews
      • I don't know what the game will be like ...
      • Every day you get in your life is precious
      • Last year, it took a full hour!
      • Bill Slawski's Top Ten SEO Patents list
      • Cliff Click's deep dive into Constant Propagation ...
      • A great example of the power of generators
      • I have no idea what it will be like to play ...
      • I'm really enjoying Dan Boneh's online cryptograph...
      • BATS IPO Algorithmic Snafu?
      • The complicated task of trying NOT to keep up with...
      • It's not just a game...
      • Internet scale, administrations, and operations
      • Followup, followup, followup
      • Perfectly named for your career
      • rands on hacking
      • Mapping the past
      • Mark your calendars!
      • The Google I/O 2012 website is live
      • Software in the courts
      • Great gaming post
      • Link dumping, March 2012 edition
      • Are some algorithms simply too hard to implement c...
      • The Stanford online cryptography class is up and r...
      • A fine use of the web
      • RSA 2012 has come and gone
      • How will the field of software analysis emerge?
      • Satan's Trifecta
      • I love style guides
    • ►  February (10)
Powered by Blogger.

About Me

Unknown
View my complete profile