An excellent article from the super-smart Don MacAskill, CEO of Smugmug, on how they survived the Amazon Web Services outage and a few tidbits about how their system is designed. One little tidbit:
Once your system seems stable enough, start surprising your Ops and Engineering teams by killing stuff in the middle of the day without warning them. They’ll love you.
You know you’re confident in your system when you throw a dart at a board full of “kill” buttons. :-)
Don was one of the early folks to make a big commitment with AWS – he’s been through a lot with them, and has learned a ton of useful things. Definitely worth a read!
Chaos Monkey is the term I’ve heard applied to a service you write to randomly kill services throughout your application stack. The name says it all!