How To Copy Terabytes of Data Between AWS S3 Buckets by@hariohmprasath

How To Copy Terabytes of Data Between AWS S3 Buckets

Hari Ohm Prasath HackerNoon profile picture

Hari Ohm Prasath

Problem statement:

Part of our regular production upgrade we were trying backup data in s3 bucket with

Item Count: 1,000,344 and Size: ~130 GB

We were basically initiating a backup using the regular s3 commands like:

aws s3 cp --recursive s3://<bucket>>
aws s3 sync s3://<bucket> s3://<bucket>>

During execution, we noticed it took hours and hours to perform the copy. There is no way to make it faster. The only workaround we found is to run these aws commands in parallel in multiple terminals so they all can operate on different s3 partitions at the same time and perform copy faster, which is neither an elegant solution nor scalable.

Other options:

We tried a couple of other options mentioned in stack overflow and AWS forums like

S3 Batch operations

S3 batch operations seem to solve this problem, but at this point, it doesn’t support it on objects encrypted based on the KMS key. When I created a job to copy the contents of the bucket with KMS key encryption enabled got the following error:

Unsupported encryption type used: SSE_KMS

When I read more about this AWS docs it stated under “Specifying a Manifest” section → Manifests that use server-side encryption with
customer-provided keys (SSE-C) and server-side encryption with AWS KMS
managed keys (SSE-KMS) are not supported


s3-dist-cp seems to be promising but when I ran it against a bucket with had closer to 6 TB of data the job failed while running “reduce” task after
40 minutes without any clear indication of why it failed

Custom approach:

Unfortunately, none of those mentioned above approaches solved our problem, so we came up with this approach. This approach can be further optimized, so think as a first step to solve this problem.

It's a 2 step process, which is a combination of shell script and spark code. First, we need to generate the record file (with object keys), then running a spark code to copy the files in parallel across nodes in multiple tasks.

Generating the record file:

We need to generate a text file containing object keys of the items inside the source s3 bucket (to be copied), is done by running this command on any EC2 instances:

aws s3 ls s3://test_bucket --recursive | awk '{print $4}' > /tmp/output.txt

Output: (just object keys one in each line)

data/solution=33/, etc

Spark code:
 .flatMap((FlatMapFunction<String, String>) s -> Arrays.asList(s.split("\n")).iterator(), Encoders.STRING())
 .map((MapFunction<String, String>) s -> String.format("aws s3 cp %s s3://%s/%s", String.format("s3://%s/%s", source, s), target, s), Encoders.STRING())
 .foreachPartition((ForeachPartitionFunction<String>) iterator -> {
       while (iterator.hasNext())

Spark Submit:

spark-submit — class com.s3.S3Copy s3://test_bucket/copier.jar test_bucket back_up_bucket s3://test_bucket/output.txt

args[0] → Source bucket

args[1] → Target bucket

args[3] → s3 record file generated in previous step

This code will read the “output.txt” file and splits into multiple partitions and runs them in parallel across multiple nodes.

Performance Test

With 15 EMR core nodes each of m4.xlarge instance type, we were able to copy 5.5 TB of data in less than 40 minutes. Since we pay EMR only for the time, we use it is cost-effective (further cost reduction is possible by going with SPOT or EC2 fleet configuration) and much scalable compared to the previous approach.

Spark submit:

spark-submit —conf —conf

spark.executor.heartbeatInterval=410000s —conf

spark.yarn.scheduler.mode=FAIR —conf

spark.shuffle.service.enabled=true —conf

spark.serializer=org.apache.spark.serializer.KryoSerializer —conf

spark.executor.memoryOverhead=1024 —conf

spark.driver.memoryOverhead=1024 —conf

spark.executor.instances=74 —conf

spark.executor.cores=6 —conf spark.driver.cores=6 —conf

spark.driver.memory=10g —conf spark.executor.memory=10g —conf

spark.default.parallelism=888 —deploy-mode cluster —master yarn —conf

spark.sql.broadcastTimeout=360000 —class com.s3.S3Copy s3://dmp-dms-k8s-dev-fico-pto-tenant/copier.jar test_bucket back_up_bucket s3://test_bucket/output.txt

Previously published at