tombola careers
tombola careers


Continuous Disaster Recovery


Continuous Disaster Recovery

Posted by Ryan Coates
Recently I’ve been fortunate enough to be working on a pretty widespread project. This has forced me to touch on many different technical aspects of the business to deliver and as a result could change the way we deliver in the future.

Recently I’ve been fortunate enough to be working on a pretty widespread project. This has forced me to touch on many different technical aspects of the business to deliver and as a result could change the way we deliver in the future.

It’s also allowed me to coin the CDR (Continuous Disaster Recovery) term, which is what I would like to talk about a bit.

I’m not going to bore you with all the technical details of the project but I think it’s worth giving a bit of a summary for the sake of context.

The project is (can projects ever really be in the past?) to deliver a usable Tombola CMS (Content management system) for all our international countries. Luckily for us other teams had been working tirelessly in making a robust and feature complete CMS server, so all we would need to do is drop it on a server and “poof” content managed? No, not at all.

We made some decisions early on that would need to be implemented to deliver the project.

  1. We would script our infrastructure using terraform
  2. We would ship our environment with our code using amis
  3. We would centralise application configuration

While these three design goals complicated the project to no end, and probably should have been isolated to separate projects, they did unlock some pretty cool options for deployments.

After loads of work we had scripted the infrastructure, were baking images and had all our configuration defined in the appropriate environment, all we needed now was a way to deploy and synchronize our CMS state between environments.

Our CMS wouldn’t really run like a traditional code pipeline, normally a developer/author makes changes then it is passed to various staging environments where it is tested and approved before going to live. The CMS would need to run differently, authors would make changes on the live site, have someone preview their changes and publish it. It would be instantly live and what this would mean is Live becomes our single source of truth which other environments would synchronize with.

Our CMS uses two places to store it’s state, s3 buckets are used to store images and media while a relational database is used to store the object representations of the pages.

So all we would need to do is restore the database from the source and copy all the s3 content across too, easy right? Well yes, actually it is.

#this script is used to take an existing snapshot for the cms instance, delete the existing db and restore it with the snapshot, it then adds dependencies missing from the snap shot
#finally the s3 buckets are synchronized

echo "deleting database"
delete_command="aws rds delete-db-instance --db-instance-identifier $db_instance --skip-final-snapshot"
eval $delete_command
echo "restoring from snapshot"
aws rds restore-db-instance-from-db-snapshot --db-instance-identifier $db_instance --db-snapshot-identifier $snapshot_name --db-subnet-group-name mysql_subnet_group
echo "waiting for db instance to become available"
wait_command="aws rds wait db-instance-available --db-instance-identifier $db_instance"
eval $wait_command
echo "adding security groups"
aws rds modify-db-instance --db-instance-identifier $db_instance --vpc-security-group-ids $vpc_security_group

#s3 bucket will need cross account sharing for this to work
echo "sync s3 buckets"
aws s3 sync $source_bucket $target_bucket
echo "All done!"

A simple bash script makes this all happen but there are a few caveats.

This assumes that a snapshot exists and has been shared with the environment you wish, also the snapshot needs to be copied which you can do like this:

aws rds copy-db-snapshot --source-db-snapshot-identifier shared-snapshot --target-db-snapshot-identifier local-snapshot

So you should have routine tasks, taking snapshots, sharing them and copying them.

You also need to make sure the script that runs this has delegated privileges to the source s3 bucket, like so:

"Sid": "umbracoRestoreS3SharedAccount",
"Principal": {"AWS": "accno"},
"Action": [
"Effect": "Allow",
"Resource": "arn:aws:s3:::source-bucket/*"

So now I can bring up a CMS stack with all of it’s dependencies, minimal security privileges and with a single script synchronize it with the live instance.

So why does this matter?

  • Well we can bring a fully synchronized environment up in minutes which means developers and authors can play around an experiment with the confidence that they can never break dev.
  • We can disable unneeded environments
  • We could even use this approach to facilitate testing, it could even support testing automation.
  • Finally it forces us to constantly test out and improve our DR strategy

It’s still early days for us working with this but the current trends in shipping environments with code, scripting infrastructure and cloud computing is giving us some fantastic opportunities which we will definitely be taking advantage of.

read more