The Fire Drill

A fire drill is a method of practicing the evacuation of a building for a fire or other emergency. Generally, the emergency system (usually an alarm) is activated and the building is evacuated as though a real fire had occurred. Usually, the time it takes to evacuate is measured to ensure that it occurs within a reasonable length of time, and problems with the emergency system or evacuation procedures are identified to be remedied.

http://en.wikipedia.org/wiki/Fire_drill

At The Conversation, we’ve been running fire drills recently. Not to evacuate a building, but to see how long it takes us to rebuild our entire production environment. Take a new VPS, and time how long it takes to get our production environment running again.

Our process for doing this has evolved over the past few weeks, to the point where we can now run a single command to provision an entire VPS and restore from our database backups.

The first fire drill

The first test was to figure out if we could actually do it, without the help of our resident systems expert. So we hosed our staging environment, and told a random developer on the team to fix it. Despite all of our documentation and existing build scripts, it took nearly 2 days.

Our documentation claimed a record of 30 minutes to accomplish this feat, so we had failed miserably.

Houston… we have a problem

We use dollhouse and babushka to provision our servers. It’s great to be able to build a full stack on a remote machine by running one command. So when we sat down to run the fire drill, we were pretty confident that anyone in our team could get things going again.

So why did it take 2 days?

Along with our build scripts, we have a document that describes all of the manual processes required to truly build a new production environment:

How to spin up a new VM
Updating IP addresses in the dollhouse config file
Removing IP address from your known_hosts file
Updating the IP address for the remote server in your git config
How to find and restore a database backup

Like most documents, it was a little out of date, as were some of our Babushka scripts.

What should have taken 30 minutes, took 2 days of document and build script debugging.

We re-ran the fire drill, fixing problems as they arose, until we were confident in our ability to rebuild the staging environment.

Then we handed it all over to another developer, to see if we had really nailed it.

We failed again

What happened? We thought we had the build nailed, I guess we didn’t. New pieces of infrastructure had been installed, and while scripted via babushka, they hadn’t been tested in conjunction with all of the other scripts.

This time it only took a day to get everything up to date. We were getting better at this, but we weren’t yet comfortable with what we had.

We worked hard on the build scripts, wrapping our manual steps up into another script, so that we couldn’t forget a step, or mistype something. We weren’t happy until we had one command to run, passing in an IP address and environment variable, to build a full server stack, restore from a database backup and start our application.

Our new build document is much cleaner:

Spin up a new VM
$ rake server:provision IP=xxx.xxx.xxx.xxx ENV=staging
Drink beer

Our current setup isn’t without flaws, but we’ve learnt a lot in the past few weeks.

What did we learn?

Documentation gets out of date.
Build scripts get out of date.
Automate everything, manual steps get missed.
Fire drills should be run, all the time, to ensure correctness.

In the future, we’re hoping to set up a build on our CI server that will run nightly, completely rebuilding our staging environment from scratch.

Essentially we want an automated fire drill.

Building The Conversation

The Fire Drill

Author

Partners

The first fire drill

Houston… we have a problem

We failed again

What did we learn?

Want to write?