Building The Conversation

Fun with hourly database backups and git

For our 6 month milestone at The Conversation we wanted to do something a little bit special. The nostalgia of these occasions usually calls for some kind of imagery to jog the memory (and jerk the tears). We didn’t want to disappoint…

The editors at The Conversation publish an amazing amount of content every day. The site is constantly growing and changing. How about a screenshot of the home page (twice a day) for the past six months? Sounds like it’s time for some crazy Ruby hax!

The basic algorithm is:

  • download an hourly database backup
  • restore the database
  • reset the codebase to the git refspec at the time of the backup
  • install the dependant gems with bundler
  • start the app
  • save a screenshot
  • stop the app
  • rinse and repeat

The backups

Thanks to some forward-thinking by Ben Hoskings, our database is backed up hourly to Cloud Files with a special filename format. For example:

tc_production-2011-09-11-12:18:02-3568a4a.psql.gz

This format is really the key to being able to pull off a stunt like this. Importantly it captures not only the time when the database was backed up, but also the git refspec of the app in production. Having these two pieces of information allows us to marry up the database and the codebase at every hour since the site was launched.

The download_backup method streams the backup (if we haven’t already downloaded it) given a Cloud Files object:

def download_backup(object)
  Dir.mkdir BACKUP_DIR unless Dir.exists?(BACKUP_DIR)
  path = File.join(BACKUP_DIR, object.name)

  puts "downloading backup to #{path}..."

  unless File.exists?(path) && Digest::MD5.file(path) == object.etag
    open(path, "w") do |file|
      object.data_stream do |chunk|
        putc '.'
        file > #{LOG_PATH} 2>&1}
end

The codebase

From the backup filename we can extract the git refspec we need to be running, then it’s just a matter of a git reset hard:

def set_refspec(refspec)
  puts "setting refspec to #{refspec}..."
  %x{cd #{APP_DIR} && git reset --hard #{refspec} >> #{LOG_PATH} 2>&1}
end

Bundling the app

Bundling the app prooved trickier than expected. The problem was that bundling from within a Ruby process which is itself running bundler is like Inception: shit gets complicated. It turns out that you’ve got to shell out within a Bundler.with_clean_env block, this tells bundler to clean up its environment.

RVM was the other beast which needed taming, running everything inside a bash login subshell seemed to make it happy.

Here’s the bundle_app method:

def bundle_app
  puts "bundling..."
  Bundler.with_clean_env do
    %x{bash -lc "cd #{APP_DIR} && source ~/.rvm/scripts/rvm && rvm use 1.9.3@workspace --create && gem install bundler && pwd && bundle install >> #{LOG_PATH} 2>&1"}
  end
end

Firing up the app

The next challenge was starting the Rails server. Obviously we can’t just run it within the same process as the server will block, we need to fork it. But how will we know when Rails has started and is ready to accept requests? The answer lies in the lsof utility, one of those unix power tools with a man page so long it make your eyes glaze over.

Running lsof -Fp -i :3000 will output the PID of the process running on port 3000, which will be the Rails server when it’s booted up and ready to receive incoming connnections. If lsof returns nothing then we sleep for a second and try again.

def start_rails
  rails_pid = nil

  puts "starting rails..."

  fork do
    Bundler.with_clean_env do
      %x{bash -lc "cd #{APP_DIR} && source ~/.rvm/scripts/rvm && rvm use 1.9.2@workspace && bundle exec rails s -p #{RAILS_PORT} >> #{LOG_PATH} 2>&1"}
    end
  end

  while true
    putc "."
    rails_pid = %x{lsof -Fp -i :#{RAILS_PORT}}
    rails_pid = rails_pid[1..-1] # strip prefix 'p' char off the lsof PID output.
    break if rails_pid
    sleep 1
  end

  puts "running on pid #{rails_pid}"
  rails_pid.to_i
end

Screenshot

Once Rails is running we need to browse to the homepage and save a screenshot. Selenium is the perfect tool for the job, after we hit the homepage we fire off some javascript to resize the browser window and then save a screenshot.

def save_screenshot(path)
  Dir.mkdir SCREENSHOT_DIR unless Dir.exists?(SCREENSHOT_DIR)
  puts "saving screenshot to #{path}..."
  Selenium::WebDriver.for(:firefox).tap do |driver|
    driver.navigate.to "http://localhost:#{RAILS_PORT}/"
    driver.execute_script %Q{window.resizeTo(#{SCREENSHOT_WIDTH}, #{SCREENSHOT_HEIGHT});}
    driver.save_screenshot(path)
    driver.quit
  end
end

Post-processing

After we’ve replayed the history of the app and have a swag full of screenshots, we’re done right? Not exactly, because Selenium doesn’t crop the screenshots for you the height of the screenshot varies with the length of the content on the homepage. We need to crop every image to 1024x768, enter Mogrify. Mogrify is utility that comes with ImageMagick and is used to batch-process image files. To crop the top of every image to 1024x768, use the following:

mogrify -crop 1024x768 -gravity North *.png

The results

Read the code.