paint-brush
All You Need to Know to Repatriate from AWS S3 to MinIOby@minio
8,892 reads
8,892 reads

All You Need to Know to Repatriate from AWS S3 to MinIO

by MinIOMarch 22nd, 2024
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

Let's dig a little deeper into the costs and savings associated with repatriation to make it easier for you to put together your own analysis.

People Mentioned

Mention Thumbnail
featured image - All You Need to Know to Repatriate from AWS S3 to MinIO
MinIO HackerNoon profile picture


The response to our previous post, How to Repatriate From AWS S3 to MinIO, was extraordinary - we’ve fielded dozens of calls from enterprises asking us for repatriation advice. We have aggregated those responses into this new post, where we dig a little deeper into the costs and savings associated with repatriation to make it easier for you to put together your own analysis. Migration of data is a daunting task for many. In practice, they target new data to come to MinIO and take their sweet time to migrate old data from the cloud or leave it in place and not grow.

Repatriation Overview

To repatriate data from AWS S3, you will follow these general guidelines:


  1. Review Data Requirements: Determine the specific buckets and objects that need to be repatriated from AWS S3. Make sure you understand business needs and compliance requirements on a bucket-by-bucket basis.


  2. Identify Repatriation Destination: You’ve already decided to repatriate to MinIO, now you can choose to run MinIO in an on-premises data center or at another cloud provider or colocation facility. Using the requirements from #1, you will select hardware or instances for forecasted storage, transfer and availability needs.


  3. Data Transfer: Plan and execute the transfer of data from AWS S3 to MinIO. Simply use MinIO's built-in Batch Replication or mirror using the MinIO Client (see How to Repatriate From AWS S3 to MinIO for details). There are several additional methods you can use for data transfer, such as using AWS DataSync, AWS Snowball or TD SYNNEX data migration, or directly using AWS APIs.


  4. Data Access and Permissions: Ensure that appropriate access controls and permissions are set up for the repatriated data on a per-bucket basis. This includes IAM and bucket policies for managing user access, authentication, and authorization to ensure the security of the data.


  5. Object Locks: It is critical to preserve the object lock retention and legal hold policies after the migration. The target object store has to interpret the rules in the same way as Amazon S3. If you are unsure, ask for the Cohasset Associates Compliance Assessment on the target object store implementation.


  6. Data Lifecycle Management: Define and implement a data lifecycle management strategy for the repatriated data. This includes defining retention policies, backup and recovery procedures, and data archiving practices on a per-bucket basis.


  7. Data Validation: Validate the transferred data to ensure its integrity and completeness. Perform necessary checks and tests to ensure that the data has been successfully transferred without any corruption or loss. After the transfer, the object name, ETag and metadata, checksum and the number of objects all match between the source and destination.


  8. Update Applications and Workflows: The good news is that if you follow cloud-native principles to build your applications, then all you will have to do is reconfigure them for the new MinIO endpoint. However, if your applications and workflows were designed to work with the AWS ecosystem, make the necessary updates to accommodate the repatriated data. This may involve updating configurations, reconfiguring integrations or in some cases modifying code.


  9. Monitor and Optimize: Continuously monitor and optimize the repatriated data environment to ensure optimal performance, cost-efficiency, and adherence to data management best practices.

Repatriation Steps

There are many factors to consider when budgeting and planning for cloud repatriation. Fortunately, our engineers have done this with many customers and we’ve developed a detailed plan for you. We have customers that have repatriated everything from a handful of workloads to hundreds of petabytes.


The biggest planning task is to think through choices around networking, leased bandwidth, server hardware, archiving costs for the data not selected to be repatriated, and the human cost of managing and maintaining your own cloud infrastructure. Estimate these costs and plan for them. Cloud repatriation costs will include data egress fees for moving the data from the cloud back to the data center. These fees are intentionally high enough to compel cloud lock-in. Take note of these high egress fees - they substantiate the economic argument to leave the public cloud because, as the amount of data you manage grows, the egress fees increase. Therefore, if you’re going to repatriate, it pays to take action sooner rather than later.


We’re going to focus on data and metadata that must be moved – this is eighty percent of the work required to repatriate. Metadata includes bucket properties and policies (access management based on access/secret key, lifecycle management, encryption, anonymous public access, object locking and versioning).


Let’s focus on data (objects) for now. For each namespace you want to migrate, take inventory of the buckets and objects you want to move. It is likely that your DevOps team already knows which buckets hold important current data. You can also use Amazon S3 Inventory. At a high level, this will look something like:


Namespace

Total Buckets

Total Object Count

Total Object Size (GB)

Daily Total Upload (TB)

Daily Total Download (TB)

ns-001

166

47,751,258

980,014.48

50.04

14.80

ns-002

44

24,320,810

615,033.35

23.84

675.81

ns-002

648

88,207,041

601,298.91

328.25

620.93

ns-001

240

68,394,231

128,042.16

62.48

12.45


The next step is to list, by namespace, each bucket and its properties for every bucket you’re going to migrate. Note the application(s) that store and read data in that bucket. Based on usage, classify each bucket as hot, warm or cold tier data.


In an abridged version, this will look something like


Bucket Name

Properties

App(s)

Hot/Warm/Cold Tier

A

Copy and paste JSON here

Spark, Iceberg, Dremio

Hot

B

Copy and paste JSON here

Elastic

Warm

C

Copy and paste JSON here

Elastic (snapshots)

Cold


You have some decisions to make about data lifecycle management at this point and pay close attention because here’s a great way to save money on AWS fees. Categorize objects in each bucket as hot, warm or cold based on how frequently they are accessed. A great place to save money is to migrate cold tier buckets directly to S3 Glacier – there’s no reason to incur egress fees to download just to upload again.


Depending on the amount of data you’re repatriating, you have a few options to choose how to migrate. We recommend that you load and work with new data on the new MinIO cluster while copying hot and warm data to the new cluster over time. The amount of time and bandwidth needed to copy objects will, of course, depend on the number and size of the objects you’re copying.


Here’s where it will be very helpful to calculate the total data that you’re going to repatriate from AWS S3. Look at your inventory and total the size of all the buckets that are classified as hot and warm.


Total Hot and Warm Tier Data = 1,534,096.7 GB

Available bandwidth = 10 Gbps

Minimum Transfer Time required (total object size / available bandwidth) = 14.2 days


Calculate data egress fees based on the above total. I’m using list price, but your organization may qualify for a discount from AWS. I’m also using 10 Gbps as the connection bandwidth, but you may have more or less at your disposal. Finally, I’m working from the assumption that one-third of S3 data will merely be shifted to S3 Glacier Deep Archive.


Total Data Tiered to S3 Glacier = 767,048.337 GB

S3 to S3 Glacier transfer fees ($0.05/1000 objects) = $3,773.11

S3 Glacier Deep Archive monthly storage fee = $760


Don’t forget to budget for S3 Glacier Deep Archive usage moving forward.


Total Data to be Transferred = 1,534,096.7 GB

First 10 TB at $0.09/GB = $900

Next 40 TB at $0.085/GB = $3,400

Next 100 TB at $0.07/GB = $70,000

Additional over 150 TB at $0.05/GB = $69,205

Total Egress Fees = $143,504


For the sake of simplicity, the above calculation includes neither the fee for per object operations ($0.40/1m) nor the cost of LISTing ($5/1m). For very large repatriation projects, we can also compress objects before sending them across the network, saving you some of the cost of egress fees.


Another option is to use AWS Snowball to transfer objects. Snowball devices are each 80TB, so we know up front that we need 20 of them for our repatriation effort. The per-device fee includes 10 days of use, plus 2 days for shipping. Additional days are available for $30/device.


20 Snowball Devices Service Fee ($300 ea) = $6,000

R/T shipping (3-5 days at $400/device) = $8,000

S3 data out ($0.02/GB) = $30,682

Total Snowball Fees = $38,981.93


AWS will charge you standard request, storage, and data transfer rates to read from and write to AWS services including Amazon S3 and AWS Key Management Service (KMS). There are further considerations when working with Amazon S3 storage classes. For S3 export jobs, data transferred to your Snow Family device from S3 are billed at standard S3 charges for operations such as LIST, GET, and others. You are also charged standard rates for Amazon CloudWatch Logs, Amazon CloudWatch Metrics, and Amazon CloudWatch Events.


Now we know how long it will take to migrate this massive amount of data and the cost. Make a business decision as to which method meets your needs based on the combination of timing and fees.


At this point, we also know the requirements for the hardware needed to run MinIO on-prem or at a colocation facility. Take the requirement above for 1.5PB of storage, estimate data growth, and consult our Recommended Hardware & Configuration page and Selecting the Best Hardware for Your MinIO Deployment.


The first step is to recreate your S3 buckets in MinIO. You’re going to have to do this regardless of how you choose to migrate objects. While both S3 and MinIO store objects using server-side encryption, you don’t have to worry about migrating encryption keys. You can connect to your KMS of choice using MinIO KES to manage encryption keys. This way, new keys will be automatically generated for you as encrypted tenants and buckets are created in MinIO.


You have multiple options to copy objects: Batch Replication and mc mirror. My previous blog post, How to Repatriate From AWS S3 to MinIO included detailed instructions for both methods. You can copy objects directly from S3 to on-prem MinIO, or use a temporary MinIO cluster running on EC2 to query S3 and then mirror to on-prem MinIO.


Typically, customers use tools we wrote combined with AWS Snowball or TD SYNNEX’s data migration hardware and services to move larger amounts of data (over 1 PB).


MinIO recently partnered with Western Digital and TD SYNNEX to field a Snowball alternative. Customers can schedule windows to take delivery of the Western Digital hardware and pay for what they need during the rental period. More importantly, the service is not tied to a specific cloud - meaning the business can use the service to move data into, out of, and across clouds - all using the ubiquitous S3 protocol. Additional details on the service can be found on the Data Migration Service page on the TD SYNNEX site.


Bucket metadata, including policies and bucket properties, can be read using get-bucket S3 API calls and then set up in MinIO. When you sign up for MinIO SUBNET, our engineers will work with you to migrate these settings from AWS S3: access management based on access key/secret key, lifecycle management policies, encryption, anonymous public access, immutability and versioning. One note about versioning, AWS version ID isn’t usually preserved when data is migrated because each version ID is an internal UUID. This is largely not a problem for customers because objects are typically called by name. However, if AWS version ID is required, then we have an extension that will preserve it in MinIO and we’ll help you enable it.


Pay particular attention to IAM and bucket policies. S3 isn’t going to be the only part of AWS’s infrastructure that you leave behind. You will have a lot of service accounts for applications to use when accessing S3 buckets. This would be a good time to list and audit all of your service accounts. Then you can decide whether or not to recreate them in your identity provider. If you choose to automate, then use Amazon Cognito to share IAM information with external OpenID Connect IDPs and AD/LDAP.


Pay particular attention to Data Lifecycle Management, such as object retention, object locking and archive/tiering. Run a get-bucket-lifecycle-configuration on each bucket to obtain a human-readable JSON list of lifecycle rules. You can easily recreate AWS S3 settings using MinIO Console or MinIO Client (mc). Use commands such as get-object-legal-hold and get-object-lock-configuration to pinpoint objects that require special security and governance treatment.


While we’re on the subject of lifecycle, let’s talk about backup and disaster recovery for a moment. Do you want an additional MinIO cluster to replicate to, for backup and disaster recovery?


After objects are copied from AWS S3 to MinIO, it’s important to validate data integrity. The easiest way to do this is to use the MinIO Client to run mc diff against old buckets in S3 and new buckets on MinIO. This will compute the difference between the buckets and return a list of only those objects that are missing or different. This command takes the arguments of the source and target buckets. For your convenience, you may want to create aliases for S3 and MinIO so you don’t have to keep typing out full addresses and credentials. For example:


mc diff s3/bucket1 minio/bucket1 


The great news is that all you have to do is point existing apps at the new MinIO endpoint. Configurations can be rewritten app by app over a period of time. Migrating data in object storage is less disruptive than a filesystem, just change the URL to read/write from a new cluster. Note that if you previously relied on AWS services to support your applications, those won’t be present in your data center, so you’ll have to replace them with their open-source equivalent and rewrite some code. For example, Athena can be replaced with Spark SQL, Apache Hive and Presto, Kinesis with Apache Kafka, and AWS Glue with Apache Airflow.


If your S3 migration is part of a larger effort to move an entire application on-prem, then chances are you used S3 event notifications to call downstream services when new data arrived. If this is the case, then do not fear - MinIO supports event notification as well. The most straightforward migration here would be to implement a custom webhook to receive the notification. However, if you need a destination that is more durable and resilient, then use messaging services such as Kafka or RabbitMQ. We also support sending events to databases such as PostgreSQL and MySQL.


Now that you’ve completed repatriating, it’s time to turn your attention to storage operation, monitoring and optimization. The good news is that no optimization is needed for MinIO – we’ve built optimization right into the software so you know you’re getting the best performance for your hardware. You’ll want to start monitoring your new MinIO cluster to assess resource utilization and performance on an ongoing basis. MinIO exposes metrics via a Prometheus endpoint that you can consume in your monitoring and alerting platform of choice. For more on monitoring, please see Multi-Cloud Monitoring and Alerting with Prometheus and Grafana and Metrics with MinIO using OpenTelemetry, Flask, and Prometheus.


With SUBNET, we have your back when it comes to Day 2 operations with MinIO. Subscribers gain access to built-in automated troubleshooting tools to keep their clusters running smoothly. They also get unlimited, direct-to-engineer support in real-time via our support portal. We also help you future-proof your object storage investment with an annual architecture review.

Migrate and Save

It’s far from a secret that the days of writing blank checks to cloud providers are gone. Many businesses are currently evaluating their cloud spend to find potential savings. Now you have everything you need to start your migration from AWS S3 to MinIO, including concrete technical steps and a financial framework.


If you get excited about the prospect of repatriation cost savings, then please reach out to us at [email protected].


Also appears here.