Problem statement: Part of our regular production upgrade we were trying backup data in s3 bucket with Item Count: 1,000,344 and Size: ~130 GB We were basically initiating a backup using the regular s3 commands like: --recursive ://< sync ://< ://< aws s3 cp s3 bucket>> aws s3 s3 bucket> s3 bucket>> During execution, we noticed it took hours and hours to perform the copy. There is no way to make it faster. The only workaround we found is to run these in parallel in multiple terminals so they all can operate on different s3 partitions at the same time and perform copy faster, which is neither an elegant solution nor scalable. aws commands Other options: We tried a couple of other options mentioned in stack overflow and AWS forums like S3 Batch operations S3 batch operations seem to solve this problem, but at this point, it doesn’t support it on objects encrypted based on the KMS key. When I created a job to copy the contents of the bucket with KMS key encryption enabled got the following error: Unsupported encryption type used: SSE_KMS When I read more about this AWS docs it stated under “Specifying a Manifest” section → Manifests that use server-side encryption with customer-provided keys (SSE-C) and server-side encryption with AWS KMS managed keys (SSE-KMS) are not supported https://docs.aws.amazon.com/AmazonS3/latest/dev/batch-ops-basics.html#specify-batchjob-manifest s3-dist-cp s3-dist-cp seems to be promising but when I ran it against a bucket with had closer to 6 TB of data the job failed while running “reduce” task after 40 minutes without any clear indication of why it failed Custom approach: Unfortunately, none of those mentioned above approaches solved our problem, so we came up with this approach. This approach can be further optimized, so think as a first step to solve this problem. It's a 2 step process, which is a combination of shell script and spark code. First, we need to generate the record file (with object keys), then running a spark code to copy the files in parallel across nodes in multiple tasks. Generating the record file: We need to generate a text file containing object keys of the items inside the source s3 bucket (to be copied), is done by running this command on any : EC2 instances aws s3 ls s3://test_bucket --recursive | awk > /tmp/output.txt '{print $4}' (just object keys one in each line) Output: data/solution=33/test1.mov data/solution=33/test2.mov, etc Spark code : sql.read() .textFile(file) .repartition( ) .flatMap((FlatMapFunction<String, String>) s -> Arrays.asList(s.split( )).iterator(), Encoders.STRING()) .map((MapFunction<String, String>) s -> String.format( , String.format( , source, s), target, s), Encoders.STRING()) .foreachPartition((ForeachPartitionFunction<String>) iterator -> { (iterator.hasNext()) Runtime.getRuntime().exec(iterator.next()).waitFor(); }); 2000 "\n" "aws s3 cp %s s3://%s/%s" "s3://%s/%s" while Spark Submit: spark-submit — class com.s3.S3Copy s3://test_bucket/copier.jar test_bucket back_up_bucket s3://test_bucket/output.txt args[0] → Source bucket args[1] → Target bucket args[3] → s3 record file generated in previous step This code will read the “output.txt” file and splits into multiple partitions and runs them in parallel across multiple nodes. Performance Test With 15 EMR core nodes each of m4.xlarge instance type, we were able to copy of data in less than Since we pay EMR only for the time, we use it is cost-effective (further cost reduction is possible by going with SPOT or EC2 fleet configuration) and much scalable compared to the previous approach. 5.5 TB 40 minutes. Spark submit: spark-submit —conf spark.network.timeout=420000s —conf spark.executor.heartbeatInterval=410000s —conf spark.yarn.scheduler.mode=FAIR —conf spark.shuffle.service.enabled=true —conf spark.serializer=org.apache.spark.serializer.KryoSerializer —conf spark.executor.memoryOverhead=1024 —conf spark.driver.memoryOverhead=1024 —conf spark.executor.instances=74 —conf spark.executor.cores=6 —conf spark.driver.cores=6 —conf spark.driver.memory=10g —conf spark.executor.memory=10g —conf spark.default.parallelism=888 —deploy-mode cluster —master yarn —conf spark.sql.broadcastTimeout=360000 —class com.s3.S3Copy s3://dmp-dms-k8s-dev-fico-pto-tenant/copier.jar test_bucket back_up_bucket s3://test_bucket/output.txt Previously published at https://medium.com/javarevisited/copying-tbs-of-data-between-s3-buckets-8438dde7dadb