MinIO runs on anything – bare metal, Kubernetes, Docker, Linux and more. Organizations choose to run MinIO to host their data on any of these platforms, and increasingly rely on multiple platforms to satisfy multiple requirements. The choice of underlying hardware and OS is based on a number of factors, primarily the amount of data to be stored in MinIO plus requirements for integration with other cloud-native software, performance and security.
Many of our customers run MinIO on bare metal, while the majority run on Kubernetes. Running multiple instances of MinIO in a containerized architecture that is orchestrated by Kubernetes is extremely efficient. MinIO customers roll out new regions and update services without disruption, with separate Kubernetes clusters running in each region, and the operational goal of shared-nothing for greatest resiliency and scalability.
Customers switch to MinIO for a variety of reasons, including:
Due to these diverse reasons and environments where MinIO can be utilized and installed, it's realistic to assume there are a number of data sources where data is already stored that you would want to get into MinIO.
In this post, let's review some of the tools available to get data out of S3, local FileSystem, NFS, Azure, GCP, Hitachi Content Platform, Ceph, and others, and into MinIO clusters where it can be exposed to cloud-native AI/ML and analytics packages.
To get started, we’ll be using the MinIO Client (mc) during the course of this post for a few of these options. Please be sure to install it and set the alias to your running MinIO Server.
mc alias set destminio https://myminio.example.net minioadminuser minioadminpassword
We will be adding some more “source” aliases as we go through the different methods.
The majority of use cases for migrating data into MinIO start with a mounted filesystem or NFS volume. In this simple configuration, you can use mc mirror to sync the data from the source to the destination. Think of mc mirror
as a swiss army knife for data synchronization. It takes the burden off of the user to determine the best way to interact with the source from which you are fetching the objects. It supports a number of sources and, based on the source you are pulling from, the right functions are used to enable them.
For example, let's start with a simple FileSystem that is mounted from a physical hard disk, virtual disk, or even something like a GlusterFS mount. As long as it's a file system readable by the OS, MinIO can read it too:
filesystem kbytes used avail capacity mounted on
/dev/root 6474195 2649052 3825143 41% /
/dev/stand 24097 5757 18340 24% /stand
/proc 0 0 0 0% /proc
/dev/fd 0 0 0 0% /dev/fd
/dev/_tcp 0 0 0 0% /dev/_tcp
/dev/dsk/c0b0t0d0s4 10241437 4888422 5353015 48% /home
/dev/dsk/c0b0t1d0sc 17422492 12267268 5155224 71% /home2
Let’s assume your objects are in /home/mydata
, you would then run the following command to mirror the objects (if the mydata
bucket does not already exist, you would have to create it first):
mc mirror /home/mydata destminio/mydata
This command will ensure that objects that are no longer in the source location are removed from the destination or when new objects get added to the source they will get copied to the destination. But if you want to overwrite existing objects modified in the source, pass the --overwrite
flag.
Network File Share (NFS) is generally used to store objects or data that are not accessed often because, while ubiquitous, often the protocol is very slow across the network. Nonetheless, a lot of ETL and some legacy systems use NFS as a repository for data to be used for operations, analytics, AI/ML, and additional use cases. It would make better sense for this data to live on MinIO because of the scalability, security and high performance of a MinIO cluster, coupled with MinIO’s ability to provide services to cloud-native applications using the S3 API.
Install the required packages to mount the NFS volume
apt install nfs-common
Be sure to add the /home
directory to /etc/exports
/home
client_ip(rw,sync,no_root_squash,no_subtree_check)
Note: Be sure to restart your NFS server, for example on Ubuntu servers
systemctl restart nfs-kernel-server
Create a directory to mount the NFS mount
mkdir -p /nfs/home
Mount the NFS volume
mount <nfs_host>:/home /nfs/home
Copy the data from NFS to MinIO
mc mirror /nfs/home destminio/nfsdata
There you go, now you can move your large objects from NFS to MinIO.
As we mentioned earlier, mc mirror
is a swiss army knife of data synchronization. In addition to filesystems, it also copies objects from S3 or S3 API compatible stores and mirrors it to MinIO. One of the more popular use cases of this is mirroring an Amazon S3 bucket.
Follow these steps to create an AWS S3 bucket in your account. If you already have an existing account with data we could use that too.
Once a bucket has been created or data has been added to an existing bucket, create a new IAM policy with access key and secret key allowing access only to our bucket. Save the generated credentials for the next step.
We can work with any S3 compatible storage using the MinIO Client. Next let’s add an alias using the S3 bucket name we created along with the credentials we downloaded
mc alias set s3 https://s3.amazonaws.com BKIKJAA5BMMU2RHO6IBB V7f1CwQqAcwo80UEIJEjc5gVQUSSx5ohQ9GSrr12 --api S3v4
Use mc mirror to copy the data from S3 to MinIO
mc mirror s3/mybucket destminio/mydata
Depending on the amount of data, network speeds and the physical distance from the region where the bucket data is stored, it might take a few minutes or more for you to mirror all the data. You will see a message when mc is done copying all the objects.
For the next set of tools, we write dedicated scripts to satisfy some of the non-standard edge case data migration requirements that we need to fulfill. One of these is migrating from HDFS and Hadoop. Many enterprises have so much data stored in Hadoop that it's impossible to ignore it and start fresh with a cloud-native platform. It is more feasible to transfer that data to something more modern (and cloud-native) like MinIO and run your ETL and other processes that way. It's rather simple to set up.
Create a file called core-site.xml
with the following contents
<configuration>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>https://minio:9000</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<value>minio-sample</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>minio-sample123</value>
</property>
</configuration>
Set the following environment variables
export HDFS_SOURCE_PATH=hdfs://namenode:8080/user/minio/testdir
export S3_DEST_PATH=s3a://mybucket/testdir
Download the following file, chmod +x and run it
curl -LSs -o https://github.com/minio/hdfs-to-minio/blob/master/hdfs-to-minio.sh
chmod +x hdfs-to-minio.sh
./hdfs-to-minio.sh
If you’ve been storing data in Hadoop for several years, then this process might take several hours. If it's on a production cluster, then we recommend migrating data in off hours during maintenance windows to minimize the impact of any performance degradation to your Hadoop cluster while data is being mirrored.
More details about migrating from HDFS to MinIO are available in this GitHub Repo, and we’ve got a blog post as well, Migrating from HDFS to Object Storage.
We previously wrote an amazing blog post on Hitachi Content Platform and how to migrate your data to a MinIO cluster. I would recommend reading the blog post for full details but the crux is as follows.
Once you have the necessary HCP cluster and input file configured, download the migration tool and run the following command to start the migration process
$ hcp-to-minio migrate --namespace-url https://finance.europe.hcp.example.com
--auth-token "HCP bXl1c2Vy:3f3c6784e97531774380db177774ac8d"
--host-header "s3testbucket.sandbox.hcp.example.com"
--data-dir /mnt/data
--bucket s3testbucket
--input-file /tmp/data/to-migrate.txt
Last but not least, we’ve kept the elephant in the room until the end. Although aging, Ceph is a popular store for data and it has a S3 compatible API. It is used by other Kubernetes projects as the backend for object storage, such as Rook. Ceph, however, is an unwieldy behemoth to set up and run. So it's natural that folks would want to move their data to something simpler, easier to maintain and with greater performance.
There are two ways to copy data from Ceph:
Bucket Replication: Creates the object but if the object is deleted from the source it will not delete it on the destination. https://min.io/docs/minio/linux/administration/bucket-replication.html
Mc mirror: Synchronizes objects and versions, it will even delete objects that do not exist https://min.io/docs/minio/linux/reference/minio-mc/mc-mirror.html
Similar to S3, since Ceph has S3 compatible API, you can add a alias to MinIO Client
mc alias set ceph http://ceph_host:port cephuser cephpass
You can then use mc mirror
to copy the data to your MinIO cluster
mc mirror ceph/mydata destminio/mydata
We suggest that you run the mc mirror
command with the --watch
flag to continuously monitor for objects and sync them to MinIO.
There are just a few examples to show you how easy it is to migrate your data to MinIO. It doesn’t matter if you are using older legacy protocols such as NFS or the latest and greatest such as S3, MinIO is here to support you.
In this post we went into detail on how to migrate from filesystems and other data stores such as NFS, filesystem, GlusterFS, HDFS, HCP, and last but not least Ceph. Regardless of the tech stack running against it, backend MinIO provides a performant, durable, secure, and scalable yet simple software-defined object storage solution.
If you have any questions feel free to reach out to us on Slack!
Also published here.