Now that we have a Rake task to copy MongoDB databases, we are facing the next problem. We store images on Amazon S3 and each environment has its own S3 bucket. So copying data from production to staging also needs to synchronize the production and the staging S3 buckets, hopefully very quickly for a very large number of files.
We’ll inspire ourselves from this post and use right_aws to connect to S3 in Ruby. Our S3 keys are stored in the heroku.yml file, your mileage may vary.
Once connected we need to fetch all the source keys from the source bucket. You might have heard that Amazon S3 limits a single query to 1000 items, but the right_aws S3Interface has a nice incremental feature. Since we’ll need to compare source and target collections, lets put the items in a hash.
My first implementation used the S3 bucket object, which turned out to be very slow. The enumeration with S3Interface takes roughly 30 seconds per 1000 items, cool. The rest is easy: we’ll walk the source hash, copy any new or changed items and then walk the target hash to delete any old items.
Here’s the full Rake task. Edit your bucket names and run rake s3:sync:production:to_staging.
The last issue is object permissions. Files copied via S3Interface don’t have their ACLs copied. In our case we want the newly created files to be public. I started by writing a task to copy permissions from the bucket itself.
Unfortunately, this forces me to have a public bucket, meaning the list of files can be enumerated. That’s not what I want. Digging deeper, the S3 interface takes an x-amz-acl header that allows us to specify a canned target ACL during copy.