A Rake Task for Copying Data between Amazon S3 Buckets

buckets

Now that we have a Rake task to copy MongoDB databases, we are facing the next problem. We store images on Amazon S3 and each environment has its own S3 bucket. So copying data from production to staging also needs to synchronize the production and the staging S3 buckets, hopefully very quickly for a very large number of files.

We’ll inspire ourselves from this post and use right_aws to connect to S3 in Ruby. Our S3 keys are stored in the heroku.yml file, your mileage may vary.

def s3i
  @@s3 ||= s3i_open
end

def s3i_open
  s3_config = YAML.load_file(Rails.root.join("config/heroku.yml")).symbolize_keys
  s3_key_id = s3_config[:production]['config']['S3_ACCESS_KEY_ID']
  s3_access_key = s3_config[:production]['config']['S3_SECRET_ACCESS_KEY']
  RightAws::S3Interface.new(s3_key_id, s3_access_key, { logger: Rails.logger })
end

Once connected we need to fetch all the source keys from the source bucket. You might have heard that Amazon S3 limits a single query to 1000 items, but the right_aws S3Interface has a nice incremental feature. Since we’ll need to compare source and target collections, lets put the items in a hash.

logger.info("[#{Time.now}] fetching keys from #{args[:from]}")
source_objects_hash = Hash.new
s3i.incrementally_list_bucket(args[:from]) do |response|
  response[:contents].each do |source_object|
    source_objects_hash[source_object[:key]] = source_object
  end
end

My first implementation used the S3 bucket object, which turned out to be very slow. The enumeration with S3Interface takes roughly 30 seconds per 1000 items, cool. The rest is easy: we’ll walk the source hash, copy any new or changed items and then walk the target hash to delete any old items.

Here’s the full Rake task. Edit your bucket names and run rake s3:sync:production:to_staging.

require 'logger'

namespace :s3 do
  namespace :sync

    def s3i
      @@s3 ||= s3i_open
    end

    def s3i_open
      s3_config = YAML.load_file(Rails.root.join("config/heroku.yml")).symbolize_keys
      s3_key_id = s3_config[:production]['config']['S3_ACCESS_KEY_ID']
      s3_access_key = s3_config[:production]['config']['S3_SECRET_ACCESS_KEY']
      RightAws::S3Interface.new(s3_key_id, s3_access_key, { logger: Rails.logger })
    end

    desc "Sync production bucket to staging."
    namespace :production do
      desc "Sync production bucket to staging."
      task :to_staging => :environment do
        Rake::Task["s3:sync:syncObjects"].execute({ from: "production", to: "staging" })
      end
    end

    desc "Sync two s3 buckets."
    task :syncObjects, [:from, :to] => :environment do |t, args|
      start_time = Time.now
      logger.info("[#{Time.now}] synchronizing from #{args[:from]} to #{args[:to]}")

      logger.info("[#{Time.now}] fetching keys from #{args[:from]}")
      source_objects_hash = Hash.new
      s3i.incrementally_list_bucket(args[:from]) do |response|
        response[:contents].each do |source_object|
          source_objects_hash[source_object[:key]] = source_object
        end
      end

      logger.info("[#{Time.now}] fetching keys from #{args[:to]}")
      target_objects_hash = Hash.new
      s3i.incrementally_list_bucket(args[:to]) do |response|
        response[:contents].each do |target_object|
          target_objects_hash[target_object[:key]] = target_object
        end
      end

      logger.info("[#{Time.now}] synchronizing #{source_objects_hash.size} => #{target_objects_hash.size} object(s)")

      source_objects_hash.each do |key, source_object|
        target_object = target_objects_hash[key]
        if (target_object.nil?)
          logger.info(" #{key}: copy")
          s3i.copy(args[:from], key, args[:to], key)
        elsif (DateTime.parse(target_object[:last_modified]) < DateTime.parse(source_object[:last_modified]))
          logger.info(" #{key}: update")
          s3i.copy(args[:from], key, args[:to], key)
        else
          logger.info(" #{key}: skip")
        end
      end

      target_objects_hash.each_key do |key|
        if (! source_objects_hash.has_key?(key))
          logger.info(" #{key}: delete")
          s3i.delete(args[:to], key)
        end
      end

      logger.info("[#{Time.now}] done (#{Time.now - start_time})")
    end
  end
end

The last issue is object permissions. Files copied via S3Interface don’t have their ACLs copied. In our case we want the newly created files to be public. I started by writing a task to copy permissions from the bucket itself.

desc "Apply bucket's ACLs on all keys in it."
task :applyAcl, [:bucket] => :environment do |t, args|
  acl = s3i.get_acl(args[:bucket])
  s3i.incrementally_list_bucket(args[:bucket]) do |response|
    response[:contents].each do |source_object|
      s3i.put_acl(args[:bucket], source_object[:key], acl[:object])
    end
  end
end

Unfortunately, this forces me to have a public bucket, meaning the list of files can be enumerated. That’s not what I want. Digging deeper, the S3 interface takes an x-amz-acl header that allows us to specify a canned target ACL during copy.

s3i.copy(args[:from], key, args[:to], key, :copy, { 'x-amz-acl' => 'public-read' } )

Please suggest any improvements!

Daniel Doubrovkine

A Rake Task for Copying Data between Amazon S3 Buckets s3 | rake | ruby

A Rake Task for Copying Data between Amazon S3 Buckets