A fire drill is a method of practicing the evacuation of a building for a fire or other emergency. Generally, the emergency system (usually an alarm) is activated and the building is evacuated as though a real fire had occurred. Usually, the time it takes to evacuate is measured to ensure that it occurs within a reasonable length of time, and problems with the emergency system or evacuation procedures are identified to be remedied.
At The Conversation, we’ve been running fire drills recently. Not to evacuate a building, but to see how long it takes us to rebuild our entire production environment. Take a new VPS, and time how long it takes to get our production environment running again.
Our process for doing this has evolved over the past few weeks, to the point where we can now run a single command to provision an entire VPS and restore from our database backups.
The first fire drill
The first test was to figure out if we could actually do it, without the help of our resident systems expert. So we hosed our staging environment, and told a random developer on the team to fix it. Despite all of our documentation and existing build scripts, it took nearly 2 days.
Our documentation claimed a record of 30 minutes to accomplish this feat, so we had failed miserably.
Houston… we have a problem
We use dollhouse and babushka to provision our servers. It’s great to be able to build a full stack on a remote machine by running one command. So when we sat down to run the fire drill, we were pretty confident that anyone in our team could get things going again.
So why did it take 2 days?
Along with our build scripts, we have a document that describes all of the manual processes required to truly build a new production environment:
- How to spin up a new VM
- Updating IP addresses in the dollhouse config file
- Removing IP address from your known_hosts file
- Updating the IP address for the remote server in your git config
- How to find and restore a database backup
Like most documents, it was a little out of date, as were some of our Babushka scripts.
What should have taken 30 minutes, took 2 days of document and build script debugging.
We re-ran the fire drill, fixing problems as they arose, until we were confident in our ability to rebuild the staging environment.
Then we handed it all over to another developer, to see if we had really nailed it.
We failed again
What happened? We thought we had the build nailed, I guess we didn’t. New pieces of infrastructure had been installed, and while scripted via babushka, they hadn’t been tested in conjunction with all of the other scripts.
This time it only took a day to get everything up to date. We were getting better at this, but we weren’t yet comfortable with what we had.
We worked hard on the build scripts, wrapping our manual steps up into another script, so that we couldn’t forget a step, or mistype something. We weren’t happy until we had one command to run, passing in an IP address and environment variable, to build a full server stack, restore from a database backup and start our application.
Our new build document is much cleaner:
- Spin up a new VM
$ rake server:provision IP=xxx.xxx.xxx.xxx ENV=staging
- Drink beer
Our current setup isn’t without flaws, but we’ve learnt a lot in the past few weeks.
What did we learn?
- Documentation gets out of date.
- Build scripts get out of date.
- Automate everything, manual steps get missed.
- Fire drills should be run, all the time, to ensure correctness.
In the future, we’re hoping to set up a build on our CI server that will run nightly, completely rebuilding our staging environment from scratch.
Essentially we want an automated fire drill.