MinIO makes a powerful primary TileDB backend because both are built for performance and scale. MinIO is a single Go binary that can be launched in many different types of cloud and on-prem environments. It's very lightweight, but also feature-packed with things like
TileDB is used to store data in a variety of applications, such as Genomics, Geospatial, Biomedical Imaging, Finance, Machine Learning, and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array, which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite programming language or data science tool via our numerous APIs and integrations.
Let’s dive in and create some test data using TileDB.
Install the TileDB pip
module, which should also install the numpy
dependency.
% pip3 install tiledb
Collecting tiledb
Downloading tiledb-0.25.0-cp311-cp311-macosx_11_0_arm64.whl (10.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 2.7 MB/s eta 0:00:00
Collecting packaging
Downloading packaging-23.2-py3-none-any.whl (53 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.0/53.0 kB 643.1 kB/s eta 0:00:00
Collecting numpy>=1.23.2
Downloading numpy-1.26.3-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 2.5 MB/s eta 0:00:00
Installing collected packages: packaging, numpy, tiledb
Successfully installed numpy-1.26.3 packaging-23.2 tiledb-0.25.0
Create a test array by running the below Python script, name it tiledb-demo.py
.
import tiledb
import numpy as np
import os, shutil
# Local path
array_local = os.path.expanduser("./tiledb_demo")
# Create a simple 1D array
tiledb.from_numpy(array_local, np.array([1.0, 2.0, 3.0]))
# Read the array
with tiledb.open(array_local) as A:
print(A[:])
Run the script
% python3 tiledb-demo.py
[1. 2. 3.]
This will create a directory called tiledb_demo
to store the actual data.
% ls -l tiledb_demo/
total 0
drwxr-xr-x 3 aj staff 96 Jan 31 05:27 __commits
drwxr-xr-x 2 aj staff 64 Jan 31 05:27 __fragment_meta
drwxr-xr-x 3 aj staff 96 Jan 31 05:27 __fragments
drwxr-xr-x 2 aj staff 64 Jan 31 05:27 __labels
drwxr-xr-x 2 aj staff 64 Jan 31 05:27 __meta
drwxr-xr-x 4 aj staff 128 Jan 31 05:27 __schema
You can continue using it as is but it's no bueno if everything is local because if the local disk or node fails then you lose your entire data. Let's do something fun, like reading this same data from a MinIO bucket instead.
We’ll start by pulling mc in our docker ecosystem and then using play.min.io to create the bucket.
Pull mc docker image
% docker pull minio/mc
Test with MinIO Play by listing all the buckets
% docker run minio/mc ls play
[LONG TRUNCATED LIST OF BUCKETS]
Create a bucket to move our local TileDB data to, name it tiledb-demo
.
% docker run minio/mc mb play/tiledb-demo
Bucket created successfully `play/tiledb-demo`.
Copy the contents of the tiledb_demo
data directory to the MinIO tiledb-demo
bucket
% docker run -v $(pwd)/tiledb_demo:/tiledb_demo minio/mc cp --recursive /tiledb_demo play/tiledb-demo
`/tiledb_demo/__commits/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21.wrt` -> `play/tiledb-demo/tiledb_demo/__commits/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21.wrt`
`/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/a0.tdb` -> `play/tiledb-demo/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/a0.tdb`
`/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/__fragment_metadata.tdb` -> `play/tiledb-demo/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/__fragment_metadata.tdb`
`/tiledb_demo/__schema/__1706696859758_1706696859758_74e7040e138a4cca93e34aca1c587108` -> `play/tiledb-demo/tiledb_demo/__schema/__1706696859758_1706696859758_74e7040e138a4cca93e34aca1c587108`
Total: 3.24 KiB, Transferred: 3.24 KiB, Speed: 1.10 KiB/s
List the contents of tiledb-demo
to make sure the data has been copied.
% docker run minio/mc ls play/tiledb-demo/tiledb_demo
[2024-01-15 14:15:57 UTC] 0B __commits/
[2024-01-15 14:15:57 UTC] 0B __fragments/
[2024-01-15 14:15:57 UTC] 0B __schema/
Note: The MinIO Client (mc
), or any S3 compatible client, only copies non-empty folders. The reason for this is that in the object storage world the data is organized based on bucket prefixes, so non-empty folders are not needed. In a future blog we’ll dive deeper into how data is organized with prefixes and folders. Hence, you see only these 3 folders and not the rest that we saw in the local folder.
Now let’s try to read the same data directly from the MinIO bucket using the Python code below, name the file tiledb-minio-demo.py
.
import tiledb
import numpy as np
# MinIO keys
minio_key = "minioadmin"
minio_secret = "minioadmin"
# The configuration object with MinIO keys
config = tiledb.Config()
config["vfs.s3.aws_access_key_id"] = minio_key
config["vfs.s3.aws_secret_access_key"] = minio_secret
config["vfs.s3.scheme"] = "https"
config["vfs.s3.region"] = ""
config["vfs.s3.endpoint_override"] = "play.min.io:9000"
config["vfs.s3.use_virtual_addressing"] = "false"
# Create TileDB config context
ctx = tiledb.Ctx(config)
# The MinIO bucket URI path of tiledb demo
array_minio = "s3://tiledb-demo/tiledb_demo/"
with tiledb.open(array_minio, ctx=tiledb.Ctx(config)) as A:
print(A[:])
The output should look familiar
% python3 tiledb-minio-demo.py
[1. 2. 3.]
We've read from MinIO, next let's see how we can write the data directly in a MinIO bucket, instead of copying it to MinIO from an existing source.
So far we’ve shown you how to read data that already exists, either in local storage or an existing bucket. But if you wanted to start fresh by writing directly to MinIO from the get-go, how would that work? Let’s take a look.
The code to write data directly to the MinIO bucket is the same as above except with two line changes.
The path to the MinIO bucket where TileDB data is stored must be updated to tiledb_minio_demo
(instead of tiledb_demo
).
We’ll use the tiledb.from_numpy
function, as we did earlier with local storage, to create the array to store in the MinIO bucket.
[TRUNCATED]
# The MinIO bucket URI path of tiledb demo
array_minio = "s3://tiledb-demo/tiledb_minio_demo/"
tiledb.from_numpy(array_minio, np.array([1.0, 2.0, 3.0]), ctx=tiledb.Ctx(config))
[TRUNCATED]
After making these 2 changes, run the script and you should see the output below
% python3 tiledb-minio-demo.py
[1. 2. 3.]
If you run the script again it will fail with the below error because it will try to write again.
tiledb.cc.TileDBError: [TileDB::StorageManager] Error: Cannot create array; Array 's3://tiledb-demo/tiledb_minio_demo/' already exists
Just comment out the following line and you can re-run it multiple times.
# tiledb.from_numpy(array_minio, np.array([1.0, 2.0, 3.0]), ctx=tiledb.Ctx(config))
% python3 tiledb-minio-demo.py
[1. 2. 3.]
% python3 tiledb-minio-demo.py
[1. 2. 3.]
Check the MinIO Play bucket to make sure the data is in there as expected
% docker run minio/mc ls play/tiledb-demo/tiledb_minio_demo/
[2024-01-15 16:45:04 UTC] 0B __commits/
[2024-01-15 16:45:04 UTC] 0B __fragments/
[2024-01-15 16:45:04 UTC] 0B __schema/
There you go, getting data into MinIO is that simple. Did you get the same results as earlier? You should have, but if you didn't there are a few things you can check out.
We’ll look at some common errors you might encounter while trying to read/write to MinIO.
If your access key and secret key are incorrect, you should expect to see an error message like below
tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://tiledb-demo/tiledb_minio_demo/__schema/'... The request signature we calculated does not match the signature you provided. Check your key and signing method.
Next, you need to ensure the hostname and port are correct, without a proper endpoint these are the errors you would encounter.
Incorrect Hostname:
tiledb.cc.TileDBError: [TileDB::S3] Error: … Couldn't resolve host name
Incorrect Port:
tiledb.cc.TileDBError: [TileDB::S3] Error: … Couldn't connect to server
Last but not least, one of the most cryptic errors I’ve seen is the following
tiledb.cc.TileDBError: [TileDB::S3] Error: … [HTTP Response Code: -1] [Remote IP: 98.44.32.5] : curlCode: 56, Failure when receiving data from the peer
After a ton of debugging it turns out that if you are connecting using http but the MinIO server has TLS activated then you will see the above error. Just be sure the connection scheme is set to the right configuration, in this case, config["vfs.s3.scheme"] = "https".
There is a rap song (you can search for it) where they rap about having stacks on stacks on stacks of *cough* cash. But there is another rap song where they claim they have so many stacks of cash that they can’t be called “stacks” anymore, they are now “racks”. Essentially when your stacks get so big and so high you need racks on racks on racks to store your stacks of cash.
This is an apt comparison because your stacks of data mean as much (or more) to you as the stacks of cash they're rapping about. If only there was something like MinIO to keep all your objects – physical or virtual – safe and readily accessible.
With MinIO in the mix, you can easily scale TileDB to
We’ve added the code snippets used in this blog to a
Also appears here.