Automated Data Replication From AWS S3 To Microsoft Azure Storage Made Easy

It may be a requirement of your business to move a good amount of data periodically from one public cloud to another. More specifically, you may face mandates requiring a multi-cloud solution. This article covers one approach to automate data replication from AWS S3 Bucket to Microsoft Azure Blob Storage container using Amazon S3 Inventory, Amazon S3 Batch Operations, Fargate, and AzCopy. Scenario Your company produces new CSV files on-premises every day with a total size of around 100GB after compression. All files have a size of 1–2 GB and need to be uploaded to Amazon S3 every night in a fixed time window between 3 am, and 5 am. Your business has decided to copy those CSV files from S3 to Microsoft Azure Storage after all files uploaded to S3. You have to find an easy and fast way to automate the data replication workflow. To accomplish this task, we can build a data pipeline to copy data periodically from S3 to Azure Storage using AWS Data Wrangler, Amazon S3 Inventory, Amazon S3 Batch Operations, Athena, Fargate, and AzCopy. The diagram below represents the high-level architecture of the pipeline solution: What we’ll cover: Create a VPC with private and public subnets, S3 endpoints, and NAT gateway. Create an Azure Storage account and blob container, generate a SAS token, then add a firewall rule to allow traffic from AWS VPC to Azure Storage. Configure daily S3 Inventory Reports on the S3 bucket. Use Athena to filter only the new objects from S3 inventory reports and export those objects’ bucket names & object keys to a CSV manifest file. Use exported CSV manifest file to create an S3 Batch Operations PUT copy Job that copies objects to a destination S3 bucket with lifecycle policy expiration rule configured. Setup an Eventbridge rule, invoke lambda function to run Fargate task that copies all objects with the same prefix in destination bucket to Azure Storage container. Prerequisites Setup an AWS account Setup an Azure account Install the latest AWS-CLI Install AWS CDK-CLI Basic understanding of AWS CDK Basic understanding of Docker Let’s begin! Creating Source and Destination S3 buckets We use CDK to build our infrastructure on AWS. First, let’s create a source Bucket to receive files from external providers or on-premise and set up daily inventory reports that provide a flat-file list of your objects and metadata. Next, create a destination bucket as temporary storage with lifecycle policy expiration rule configured on prefix /tmp_transition. All files with the prefix (eg. ) will copy to Azure and will be removed by lifecycle policy after 24hours. /tmp_transition/file1.csv Use the following code to create S3 buckets. aws_cdk ( aws_s3 s3, core, ) s3_destination = s3.Bucket(self, , lifecycle_rules=[ { : core.Duration.days( ), : }, ]) s3_source = s3.Bucket(self, , bucket_name=self.s3_source_bucket_name, encryption=s3.BucketEncryption.S3_MANAGED, inventories=[ { : s3.InventoryFrequency.DAILY, : s3.InventoryObjectVersion.CURRENT, : { : s3_destination } } ]) from import as "dataBucketInventory" 'expiration' 1.0 'prefix' 'tmp_transition' "demoDataBucket" "frequency" "include_object_versions" "destination" "bucket" Creating AWS VPC Next, we need to create VPC with both public and private subnets, NAT Gateway, an S3 endpoint, and attach an endpoint policy that allows access to the Fargate container to which S3 bucket we are copying data to Azure. Now define your VPC and related resources using the following code. aws_cdk ( aws_ec2 ec2, core, ) vpc = ec2.Vpc(self, , max_azs= , cidr= , nat_gateways= , subnet_configuration=[{ : , : , : ec2.SubnetType.PRIVATE }, { : , : , : ec2.SubnetType.PUBLIC }] ) subnets = vpc.select_subnets( subnet_type=ec2.SubnetType.PRIVATE).subnets endpoint = vpc.add_gateway_endpoint( , service=ec2.GatewayVpcEndpointAwsService.S3, subnets=[{ : subnets[ ].subnet_id }, { : subnets[ ].subnet_id }]) endpoint.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ bucket_arn, f ], principals=[iam.ArnPrincipal( )], actions=[ , , , ], )) # Provides access to the Amazon S3 bucket containing the layers each Docker image ECR. endpoint.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ f ], principals=[iam.ArnPrincipal( )], actions=[ ], )) from import as "demoVPC" 2 "10.0.0.0/16" 1 "cidrMask" 24 "name" 'private' "subnetType" "cidrMask" 24 "name" 'public' "subnetType" 's3Endpoint' "subnet_id" 0 "subnet_id" 1 "{bucket_arn}/*" "*" "s3:GetObject" "s3:GetObjects" "s3:ListObjects" "S3:ListBucket" for of "arn:aws:s3:::prod-{self.region}-starport-layer-bucket/*" "*" "s3:GetObject" While creating NAT Gateway, an Elastic IP Address will create in AWS. We will need the IP address to set up the Azure Storage Firewall rule in step3. Deploying Azure Storage Account To simplify managing resources, we can use the Azure Resource Manager template (ARM template) to deploy resources at our Azure subscription level. I will assume you already have an Azure Subscription setup. We will use Cloud shell to deploy a Resource Group, Azure Storage account, a container, and Firewall rule to allow traffic from a specific IP address. Click on the Cloud Shell icon in the Azure Portal's header bar, and it will open the Cloud Shell. Run the following command to deploy: az group create --name examplegroup --location australiaeast az deployment group create --resource-group examplegroup --template-uri https://raw.githubusercontent.com/yai333/DataPipelineS32Blob/master/Azure-Template-DemoRG/template.json --parameters storageAccounts_mydemostroageaccount_name=mydemostorageaccountaiyi --debug Once the template has been deployed, we can verify the deployment by exploring the Azure portal's resource group. All resources deployed will be displayed in the Overview section of the Resource group. Let’s create a Firewall rule for our Storage Account: Firstly, go to the storage account we just deployed. Secondly, click on the settings menu called Firewalls and virtual networks. Thirdly, check that you’ve selected to allow access from Selected networks. Then, for granting access to an internet IP range, enter AWS VPC’s public IP address (step 2) and Save. We will then generate Shared Access Signatures (SAS) to grant limited access to Azure Storage resources. Run below command in Cloudshell: RG_NAME= ACCOUNT_NAME= ACCOUNT_KEY=`az storage account keys list --account-name= --query [0].value -o tsv` BLOB_CONTAINER=democontainer STORAGE_CONN_STRING=`az storage account show-connection-string --name --resource-group --output tsv` SAS=`az storage container generate-sas --connection-string -n --expiry --permissions aclrw --output tsv` 'examplegroup' 'mydemostorageaccountaiyi' $ACCOUNT_NAME $ACCOUNT_NAME $RG_NAME $STORAGE_CONN_STRING $BLOB_CONTAINER '2021-06-30' echo $SAS We will get the required SAS and URLs that grant access to a blob container . (a)dd (d)elete (r)ead (w)rite democontainer se= &sp=racwl&sv= &sr=c&sig=xxxxbBfqfEppPpBZPOTRiwvkh69xxxx/xxxxQA0YtKo% D 2021 -06 -30 2018 -11 -09 3 Let’s move back to AWS and put SAS to AWS SSM Parameter Store. Run following command in local terminator. aws ssm put-parameter --cli-input-json '{ "Name": "/s3toblob/azure/storage/sas", "Value": "se=2021-06-30&sp=racwl&sv=2018-11-09&sr=c&sig=xxxxbBfqfEppPpBZPOTRiwvkh69xxxx/xxxxQA0YtKo%3D", "Type": "SecureString" }' Defining Lambda functions and AWS Data Wrangler layer Now, let’s move up to lambda functions. We will create three lambda functions and one lambda layer: fn_create_s3batch_manifest and DataWranglerLayer fn_create_batch_job fn_process_transfer_task fn_create_s3batch_manifest and AWS Data Wrangler layer This lambda function uses AWS Data Wrangler’s Athena module to filter new files in the past UTC date and save files list to a CSV manifest file. Copy the following code to CDK stack.py. download zip file from . awswranger-layer here datawrangler_layer = lambda_.LayerVersion(self, , code=lambda_.Code.from_asset( ), compatible_runtimes=[ lambda_.Runtime.PYTHON_3_6] ) fn_create_s3batch_manifest = lambda_.Function(self, , runtime=lambda_.Runtime.PYTHON_3_6, handler= , timeout=core.Duration.minutes( ), code=lambda_.Code.from_asset( ), layers=[ datawrangler_layer] ) fn_create_s3batch_manifest.add_environment( , s3_destination_bucket_name) fn_create_s3batch_manifest.add_environment( , self.s3_source_bucket_name) fn_create_s3batch_manifest.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ ], actions=[ , , , , , , ], )) fn_create_s3batch_manifest.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ , , ], actions=[ , , , , , ], )) s3_destination.add_event_notification(s3.EventType.OBJECT_CREATED, s3n.LambdaDestination( fn_create_s3batch_manifest), { : , : }) "DataWranglerLayer" "./layers/awswrangler-layer-1.9.6-py3.6.zip" "CreateS3BatchManifest" "lambda_create_s3batch_manifest.handler" 15 "./src" "DESTINATION_BUCKET_NAME" "SOURCE_BUCKET_NAME" "*" "glue:GetTable" "glue:CreateTable" "athena:StartQueryExecution" "athena:CancelQueryExecution" "athena:StopQueryExecution" "athena:GetQueryExecution" "athena:GetQueryResults" f"arn:aws:glue: : :catalog" {self.region} {self.account} f"arn:aws:glue: : :database/*" {self.region} {self.account} f"arn:aws:glue: : :table/*" {self.region} {self.account} "glue:GetDatabases" "glue:GetDatabase" "glue:BatchCreatePartition" "glue:GetPartitions" "glue:CreateDatabase" "glue:GetPartition" "prefix" f' /demoDataBucketInventory0/' {self.s3_source_bucket_name} "suffix" '.json' then create with the following code: ./src/lambda_create_s3batch_manifest.py json logging os datetime datetime, timedelta awswrangler wr logger = logging.getLogger() logger.setLevel(logging.DEBUG) DATABASE_NAME = TABLE_NAME = logger.info( + json.dumps(event, indent= )) DATABASE_NAME wr.catalog.databases().values: wr.catalog.create_database(DATABASE_NAME) event_date = datetime.strptime( event[ ][ ][ ], ) partition_dt = previous_partition_dt = logger.debug( ) wr.catalog.does_table_exist(database=DATABASE_NAME, table=TABLE_NAME): table_query_exec_id = wr.athena.start_query_execution(s3_output= , sql= , database=DATABASE_NAME) wr.athena.wait_query(query_execution_id=table_query_exec_id) partition_query_exec_id = wr.athena.start_query_execution( sql= , s3_output= , database=DATABASE_NAME) wr.athena.wait_query(query_execution_id=partition_query_exec_id) select_query_exec_id = wr.athena.start_query_execution(sql= + os.getenv( ) + + TABLE_NAME + + partition_dt + + + + os.getenv( ) + + TABLE_NAME + + previous_partition_dt + , database=DATABASE_NAME, s3_output= ) select_query_exec_id import import import from import import as "s3datademo" "dailyobjects" : def handler (event, context) "Received event: " 2 if not in "Records" 0 "eventTime" "%Y-%m-%dT%H:%M:%S.%fZ" f' -00-00' {(event_date - timedelta(days= )).strftime( )} 1 "%Y-%m-%d" f' -00-00' {(event_date - timedelta(days= )).strftime( )} 2 "%Y-%m-%d" f"partition_dt: " {partition_dt} if not f"s3:// /athena_output" {os.getenv( )} 'DESTINATION_BUCKET_NAME' f"CREATE EXTERNAL TABLE ( \ `bucket` string, \ key string, \ version_id string, \ is_latest boolean, \ is_delete_marker boolean, \ size bigint, \ last_modified_date timestamp, \ e_tag string \ ) \ PARTITIONED BY(dt string) \ ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' \ STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' \ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' \ LOCATION 's3:// / /demoDataBucketInventory0/hive/';" {TABLE_NAME} {os.getenv( )} 'DESTINATION_BUCKET_NAME' {os.getenv( )} 'SOURCE_BUCKET_NAME' f"ALTER TABLE ADD IF NOT EXISTS PARTITION (dt=\' \');" {TABLE_NAME} {partition_dt} f"s3:// /athena_output" {os.getenv( )} 'DESTINATION_BUCKET_NAME' 'SELECT DISTINCT bucket as "' 'SOURCE_BUCKET_NAME' '" , key as "dump.txt" FROM ' " where dt = '" "' and is_delete_marker = false" " except " 'SELECT DISTINCT bucket as "' 'SOURCE_BUCKET_NAME' '" , key as "dump.txt" FROM ' " where dt = '" "' and is_delete_marker = false ;" f"s3:// /csv_manifest/dt= " {os.getenv( )} 'DESTINATION_BUCKET_NAME' {partition_dt} return In the above coding, we use Athena query to create Glue Database, Table and add a partition to that table every day. Then lambda executes except query to return the difference between the two date partitions. Note that is asynchronous, hence no need to wait for the result in Lambda. Once the query is executed, the result will save to as a CSV file. start_query_execution s3_output=f"s3://{os.getenv('DESTINATION_BUCKET_NAME')}/csv_manifest/dt={partition_dt}" fn_create_batch_job and S3 Notification In this section, we will create a lambda function and enable Amazon S3 to send a notification to trigger when a CSV file is added to an Amazon S3 Bucket prefix. Put following code to CDK stack.py: fn_create_batch_job fn_create_batch_job /csv_manifest fn_create_batch_job = lambda_.Function(self, , runtime=lambda_.Runtime.PYTHON_3_6, handler= , timeout=core.Duration.minutes( ), code=lambda_.Code.from_asset( )) fn_create_batch_job.add_environment( , s3_batch_role.role_arn) fn_create_batch_job.add_environment( , self.s3_source_bucket_name) fn_create_batch_job.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, actions=[ ], resources=[ ] )) fn_create_batch_job.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, actions=[ ], resources=[s3_batch_role.role_arn] )) s3_destination.add_event_notification(s3.EventType.OBJECT_CREATED, s3n.LambdaDestination( fn_create_batch_job), { : , : }) "CreateS3BatchJobFunction" "lambda_create_batch_job.handler" 5 "./src" "ROLE_ARN" "SOURCE_BUCKET_NAME" "s3:CreateJob" "*" "iam:PassRole" "prefix" f'csv_manifest/' "suffix" '.csv' Create with the following code: ./src/lambda_create_batch_job.py json boto3 logging os urllib.parse unquote logger = logging.getLogger() logger.setLevel(logging.DEBUG) s3_control_client = boto3.client( ) s3_cli = boto3.client( ) logger.info( + json.dumps(event, indent= )) account_id = boto3.client( ).get_caller_identity().get( ) bucket_name = event[ ][ ][ ][ ][ ] bucket_arn = event[ ][ ][ ][ ][ ] file_key = event[ ][ ][ ][ ][ ] e_tag = event[ ][ ][ ][ ][ ] logger.info( .format(file_key, bucket_name)) response = s3_control_client.create_job( AccountId=account_id, ConfirmationRequired= , Operation={ : { : bucket_arn, : , : }, }, Report={ : bucket_arn, : , : , : , : }, Manifest={ : { : , : [ , ] }, : { : , : e_tag } }, Priority= , RoleArn=os.getenv( ), Tags=[ { : , : }, ] ) logger.info( + json.dumps(response, indent= )) import import import import from import 's3control' 's3' : def handler (event, context) "Received event: " 2 'sts' 'Account' 'Records' 0 's3' 'bucket' 'name' 'Records' 0 's3' 'bucket' 'arn' 'Records' 0 's3' 'object' 'key' 'Records' 0 's3' 'object' 'eTag' 'Reading {} from {}' False 'S3PutObjectCopy' 'TargetResource' 'StorageClass' 'STANDARD' 'TargetKeyPrefix' 'tmp_transition' 'Bucket' 'Format' 'Report_CSV_20180820' 'Enabled' True 'Prefix' f'report/ ' {os.getenv( )} "SOURCE_BUCKET_NAME" 'ReportScope' 'FailedTasksOnly' 'Spec' 'Format' 'S3BatchOperations_CSV_20180820' "Fields" "Bucket" "Key" 'Location' 'ObjectArn' f' / ' {bucket_arn} {unquote(file_key)} 'ETag' 10 "ROLE_ARN" 'Key' 'engineer' 'Value' 'yiai' "S3 barch job response: " 2 return Lambda function create S3 Batch Operation Job, copy all the files listed in CSV manifest to S3 Destination Bucket . fn_create_batch_job /tmp_transition prefix S3 Batch Operations is an Amazon S3 data management feature that lets you manage billions of objects at scale. To start S3 Batch Operation Job, we also need to set up an IAM role S3BatchRole with the corresponding policies: s3_batch_role = iam.Role(self, , assumed_by=iam.ServicePrincipal( ) ) s3_batch_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ s3_destination.bucket_arn, ], actions=[ , , , , , ], )) s3_batch_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ s3_source.bucket_arn, ], actions=[ , , ], )) s3_batch_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ ], actions=[ , , ], )) s3_batch_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ ], actions=[ , ], )) "S3BatchRole" "batchoperations.s3.amazonaws.com" f" /*" {s3_destination.bucket_arn} "s3:PutObject" "s3:PutObjectAcl" "s3:PutObjectTagging" "s3:PutObjectLegalHold" "s3:PutObjectRetention" "s3:GetBucketObjectLockConfiguration" f" /*" {s3_source.bucket_arn} "s3:GetObject" "s3:GetObjectAcl" "s3:GetObjectTagging" f" /*" {s3_destination.bucket_arn} "s3:GetObject" "s3:GetObjectVersion" "s3:GetBucketLocation" f" /report/ /*" {s3_destination.bucket_arn} {self.s3_source_bucket_name} "s3:PutObject" "s3:GetBucketLocation" fn_process_transfer_task and Eventbridge Custom rule We will create an Eventbridge custom rule that tracks an S3 Batch Operations job in Amazon EventBridge through AWS CloudTrail and send events in Completed status to the target notification resource . fn_process_transfer_task Lambda will then start a Fargate Task programmatically to copy files in prefix to Azure Storage Container . fn_process_transfer_task /tmp_transition democontainer fn_process_transfer_task = lambda_.Function(self, , runtime=lambda_.Runtime.PYTHON_3_6, handler= , timeout=core.Duration.minutes( ), code=lambda_.Code.from_asset( )) fn_process_transfer_task.add_environment( , cluster_name) fn_process_transfer_task.add_environment( , subnets[ ].subnet_id) fn_process_transfer_task.add_environment( , subnets[ ].subnet_id) fn_process_transfer_task.add_environment( , task_definition.task_definition_arn) fn_process_transfer_task.add_environment( , s3_destination_bucket_name) fn_process_transfer_task.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ task_definition.task_definition_arn ], actions=[ ], )) fn_process_transfer_task.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, actions=[ ], resources=[task_definition.execution_role.role_arn] )) fn_process_transfer_task.add_to_role_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, actions=[ ], resources=[task_definition.task_role.role_arn] )) trail = trail_.Trail( self, , send_to_cloud_watch_logs= ) event_rule = trail.on_event(self, , target=targets.LambdaFunction( handler=fn_process_transfer_task) ) event_rule.add_event_pattern( source=[ ], detail_type=[ ], detail={ : [ ], : [ ], : { : [ ] } } ) "ProcessS3TransferFunction" "lambda_process_s3transfer_task.handler" 5 "./src" "CLUSTER_NAME" "PRIVATE_SUBNET1" 0 "PRIVATE_SUBNET2" 1 "TASK_DEFINITION" "S3_BUCKET_NAME" "ecs:RunTask" "iam:PassRole" "iam:PassRole" "CloudTrail" True "S3JobEvent" 'aws.s3' "AWS Service Event via CloudTrail" "eventSource" "s3.amazonaws.com" "eventName" "JobStatusChanged" "serviceEventDetails" "status" "Complete" Create with the following code: ./src/lambda_process_s3transfer_task.py json boto3 logging os logger = logging.getLogger() logger.setLevel(logging.DEBUG) ecs = boto3.client( ) logger.info( + json.dumps(event, indent= )) logger.info( + json.dumps(os.getenv( ), indent= )) response = ecs.run_task( cluster=os.getenv( ), taskDefinition=os.getenv( ), launchType= , count= , platformVersion= , networkConfiguration={ : { : [ os.getenv( ), os.getenv( ), ], : } }, overrides={ : [{ : , : , : , : , : [ { : , : } ], }]}) str(response) import import import import 'ecs' : def handler (event, context) "Received event: " 2 "ENV SUBNETS: " 'SUBNETS' 3 "CLUSTER_NAME" "TASK_DEFINITION" 'FARGATE' 1 'LATEST' 'awsvpcConfiguration' 'subnets' "PRIVATE_SUBNET1" "PRIVATE_SUBNET2" 'assignPublicIp' 'DISABLED' "containerOverrides" "name" "azcopy" 'memory' 512 'memoryReservation' 512 'cpu' 2 'environment' 'name' 'S3_SOURCE' 'value' f'https://s3. .amazonaws.com/ /tmp_transition' {os.getenv( )} "AWS_REGION" {os.getenv( )} "S3_BUCKET_NAME" return Now, We have set up the Serverless part. Let’s move up to the Fargate task and process the data replication. Creating an AWS Fargate task We will create: An ECR image with AzCopy was installed. is a command-line utility that you can use to copy blobs or files to or from a storage account. AzCopy An ECS Cluster with a Fargte task. Let’s getting started. 1) Build ECS, ECR, and Fargate stack. aws_cdk ( aws_iam iam, aws_ecr ecr_, aws_ecs ecs, core, ) ecr = ecr_.Repository(self, ) cluster = ecs.Cluster(self, , vpc=vpc, container_insights= ) task_definition = ecs.FargateTaskDefinition( self, ) task_definition.add_container( , image=ecs.ContainerImage.from_registry( ecr.repository_uri), logging=ecs.LogDrivers.aws_logs(stream_prefix= ), environment={ : }, secrets={ : ecs.Secret.from_ssm_parameter( ssm.StringParameter.from_secure_string_parameter_attributes(self, , parameter_name= , version= )) }) task_definition.task_role.add_to_policy(iam.PolicyStatement( effect=iam.Effect.ALLOW, resources=[ bucket_arn, ], actions=[ , , , ], )) ecr.grant_pull(task_definition.obtain_execution_role()) from import as as as "azcopy" "DemoCluster" True "azcopyTaskDef" "azcopy" "s32blob" 'AZURE_BLOB_URL' 'https://mydemostroageaccount.blob.core.windows.net/democontainer/' 'SAS_TOKEN' 'sas' '/azure/storage/sas' 2 f" /*" {bucket_arn} "s3:GetObject" "s3:GetObjects" "s3:ListObjects" "S3:ListBucket" 2) Build a Docker image and install Azcopy there. alpine AS azcopy alpine: FROM RUN apk add --no-cache wget \ && wget https://aka.ms/downloadazcopy-v10-linux -O /tmp/azcopy.tgz \ && BIN_LOCATION=$(tar -tzf /tmp/azcopy.tgz | grep ) \ && tar -xzf /tmp/azcopy.tgz --strip-components=1 -C /usr/bin export "/azcopy" $BIN_LOCATION FROM 3.9 RUN apk update && apk add libc6-compat ca-certificates jq curl COPY --from=azcopy /usr/bin/azcopy /usr/ /bin/azcopy local RUN ldd /usr/ /bin/azcopy local COPY entrypoint.sh / RUN chmod 777 /entrypoint.sh ENTRYPOINT [ , ] "sh" "/entrypoint.sh" >> /root/.profile json=$(curl ) AWS_ACCESS_KEY_ID=$( | jq -r ) AWS_SECRET_ACCESS_KEY=$( | jq -r ) AWS_SESSION_TOKEN=$( | jq -r ) azcopy copy \ \ --recursive= #!/bin/bash echo "export AWS_CONTAINER_CREDENTIALS_RELATIVE_URI= " $AWS_CONTAINER_CREDENTIALS_RELATIVE_URI "http://169.254.170.2 " ${AWS_CONTAINER_CREDENTIALS_RELATIVE_URI} export echo " " $json '.AccessKeyId' export echo " " $json '.SecretAccessKey' export echo " " $json '.Token' " " ${S3_SOURCE} " ? " ${AZURE_BLOB_URL} ${SAS_TOKEN} true Note that to use AzCopy transfer files from AWS, we will need to set up AWS Credentials in the container. We can retrieve AWS credentials using: curl http://169.254.170.2/ $AWS_CONTAINER_CREDENTIALS_RELATIVE_URI 3) Push Docker image to ECR eval docker build . -t dkr.ecr.ap-southeast- amazonaws.com/YOUR_ECR_NAME docker push dkr.ecr.ap-southeast- amazonaws.com/YOUR_ECR_NAME $( - -- - -2 -- - - ) aws ecr get login region ap southeast no include email . YOUR_ACCOUNT_ID 2. . YOUR_ACCOUNT_ID 2. Great! We have what we need! You can find the full solution CDK project in my . Clone the repo and deploy the stack: Github Repo CDK-S3toblob pip install -r requirements.txt cdk deploy cd Once the stack has been successfully created, navigate to the AWS CloudFormation , locate the stack we just created, and go to the Resources tab to find the deployed resources. console Now it’s time to test our workflow; go to the S3 source bucket . Upload as many files in different folders (prefix). Wait 24 hours for the next inventory report generated; then, you will see the whole pipeline start running, and files will eventually be copied to Azure . demo-databucket-source democontainer We should see the logs of the Fargate task like the below screenshot. We can also monitor, troubleshoot, and set alarms for ECS resources using CloudWatch Container Insights. Conclusion In this article, I introduced the approach to automate data replication from AWS S3 to Microsoft Azure Storage. I walked you through how to use CDK to deploy VPC, AWS S3, Lambda, Cloudtrail, Fargte resources, showing you how to use the ARM template deploy Azure services. I showed you how to use the AWS Wrangler library and Athena query to create a table and querying the table. I hope you have found this article useful. You can find the complete project in my . GitHub repo