paint-brush
Supercharge TileDB Engine with MinIOby@minio
26,413 reads
26,413 reads

Supercharge TileDB Engine with MinIO

by MinIOMarch 21st, 2024
Read on Terminal Reader
tldt arrow
EN

Too Long; Didn't Read

MinIO makes a  powerful primary TileDB backend because both are built for performance and scale.
featured image - Supercharge TileDB Engine with MinIO
MinIO HackerNoon profile picture


MinIO makes a  powerful primary TileDB backend because both are built for performance and scale. MinIO is a single Go binary that can be launched in many different types of cloud and on-prem environments. It's very lightweight, but also feature-packed with things like replication and encryption, and it provides integrations with various applications. MinIO is the perfect companion for TileDB because of its industry-leading performance and scalability. MinIO is capable of tremendous performance – we’ve benchmarked it at 325 GiB/s (349 GB/s) on GETs and 165 GiB/s (177 GB/s) on PUTs with just 32 nodes of off-the-shelf NVMe SSDs – and is used to build data lakes/lake houses with analytics and AI/ML workloads.


TileDB is used to store data in a variety of applications, such as Genomics, Geospatial, Biomedical Imaging, Finance, Machine Learning, and more. The power of TileDB stems from the fact that any data can be modeled efficiently as either a dense or a sparse multi-dimensional array, which is the format used internally by most data science tooling. By storing your data and metadata in TileDB arrays, you abstract all the data storage and management pains, while efficiently accessing the data with your favorite programming language or data science tool via our numerous APIs and integrations.

Set Up TileDB

Let’s dive in and create some test data using TileDB.


Install the TileDB pip module, which should also install the numpy dependency.


% pip3 install tiledb


Collecting tiledb

  Downloading tiledb-0.25.0-cp311-cp311-macosx_11_0_arm64.whl (10.4 MB)

 	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.4/10.4 MB 2.7 MB/s eta 0:00:00

Collecting packaging

  Downloading packaging-23.2-py3-none-any.whl (53 kB)

 	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.0/53.0 kB 643.1 kB/s eta 0:00:00

Collecting numpy>=1.23.2

  Downloading numpy-1.26.3-cp311-cp311-macosx_11_0_arm64.whl (14.0 MB)

 	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 2.5 MB/s eta 0:00:00

Installing collected packages: packaging, numpy, tiledb

Successfully installed numpy-1.26.3 packaging-23.2 tiledb-0.25.0


Create a test array by running the below Python script, name it tiledb-demo.py.


import tiledb

import numpy as np

import os, shutil


# Local path

array_local = os.path.expanduser("./tiledb_demo")


# Create a simple 1D array

tiledb.from_numpy(array_local, np.array([1.0, 2.0, 3.0]))


# Read the array

with tiledb.open(array_local) as A:

	print(A[:])


Run the script

% python3 tiledb-demo.py

[1. 2. 3.]


This will create a directory called tiledb_demo to store the actual data.

% ls -l tiledb_demo/

total 0

drwxr-xr-x  3 aj  staff   96 Jan 31 05:27 __commits

drwxr-xr-x  2 aj  staff   64 Jan 31 05:27 __fragment_meta

drwxr-xr-x  3 aj  staff   96 Jan 31 05:27 __fragments

drwxr-xr-x  2 aj  staff   64 Jan 31 05:27 __labels

drwxr-xr-x  2 aj  staff   64 Jan 31 05:27 __meta

drwxr-xr-x  4 aj  staff  128 Jan 31 05:27 __schema


You can continue using it as is but it's no bueno if everything is local because if the local disk or node fails then you lose your entire data. Let's do something fun, like reading this same data from a MinIO bucket instead.

Migrating Data to MinIO Bucket

We’ll start by pulling mc in our docker ecosystem and then using play.min.io to create the bucket.


Pull mc docker image

% docker pull minio/mc


Test with MinIO Play by listing all the buckets

% docker run minio/mc ls play


[LONG TRUNCATED LIST OF BUCKETS]


Create a bucket to move our local TileDB data to, name it tiledb-demo.

% docker run minio/mc mb play/tiledb-demo


Bucket created successfully `play/tiledb-demo`.


Copy the contents of the tiledb_demo data directory to the MinIO tiledb-demo bucket


% docker run -v $(pwd)/tiledb_demo:/tiledb_demo minio/mc cp --recursive /tiledb_demo play/tiledb-demo


`/tiledb_demo/__commits/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21.wrt` -> `play/tiledb-demo/tiledb_demo/__commits/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21.wrt`

`/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/a0.tdb` -> `play/tiledb-demo/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/a0.tdb`

`/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/__fragment_metadata.tdb` -> `play/tiledb-demo/tiledb_demo/__fragments/__1706696859767_1706696859767_777455531063403b811b2a2bf79d40e7_21/__fragment_metadata.tdb`

`/tiledb_demo/__schema/__1706696859758_1706696859758_74e7040e138a4cca93e34aca1c587108` -> `play/tiledb-demo/tiledb_demo/__schema/__1706696859758_1706696859758_74e7040e138a4cca93e34aca1c587108`


Total: 3.24 KiB, Transferred: 3.24 KiB, Speed: 1.10 KiB/s


List the contents of tiledb-demo to make sure the data has been copied.


% docker run minio/mc ls play/tiledb-demo/tiledb_demo

[2024-01-15 14:15:57 UTC] 	0B __commits/

[2024-01-15 14:15:57 UTC] 	0B __fragments/

[2024-01-15 14:15:57 UTC] 	0B __schema/


Note: The MinIO Client (mc), or any S3 compatible client, only copies non-empty folders. The reason for this is that in the object storage world the data is organized based on bucket prefixes, so non-empty folders are not needed. In a future blog we’ll dive deeper into how data is organized with prefixes and folders. Hence, you see only these 3 folders and not the rest that we saw in the local folder.


Now let’s try to read the same data directly from the MinIO bucket using the Python code below, name the file tiledb-minio-demo.py.


import tiledb

import numpy as np


# MinIO keys

minio_key = "minioadmin"

minio_secret = "minioadmin"


# The configuration object with MinIO keys

config = tiledb.Config()

config["vfs.s3.aws_access_key_id"] = minio_key

config["vfs.s3.aws_secret_access_key"] = minio_secret

config["vfs.s3.scheme"] = "https"

config["vfs.s3.region"] = ""

config["vfs.s3.endpoint_override"] = "play.min.io:9000"

config["vfs.s3.use_virtual_addressing"] = "false"


# Create TileDB config context

ctx = tiledb.Ctx(config)


# The MinIO bucket URI path of tiledb demo

array_minio = "s3://tiledb-demo/tiledb_demo/"


with tiledb.open(array_minio, ctx=tiledb.Ctx(config)) as A:

	print(A[:])


The output should look familiar

% python3 tiledb-minio-demo.py

[1. 2. 3.]


We've read from MinIO, next let's see how we can write the data directly in a MinIO bucket, instead of copying it to MinIO from an existing source.

Writing Directly to the MinIO Bucket

So far we’ve shown you how to read data that already exists, either in local storage or an existing bucket. But if you wanted to start fresh by writing directly to MinIO from the get-go, how would that work? Let’s take a look.


The code to write data directly to the MinIO bucket is the same as above except with two line changes.


The path to the MinIO bucket where TileDB data is stored must be updated to tiledb_minio_demo (instead of tiledb_demo).


We’ll use the tiledb.from_numpy function, as we did earlier with local storage, to create the array to store in the MinIO bucket.


[TRUNCATED]


# The MinIO bucket URI path of tiledb demo

array_minio = "s3://tiledb-demo/tiledb_minio_demo/"


tiledb.from_numpy(array_minio, np.array([1.0, 2.0, 3.0]), ctx=tiledb.Ctx(config))


[TRUNCATED]


After making these 2 changes, run the script and you should see the output below

% python3 tiledb-minio-demo.py

[1. 2. 3.]


If you run the script again it will fail with the below error because it will try to write again.

tiledb.cc.TileDBError: [TileDB::StorageManager] Error: Cannot create array; Array 's3://tiledb-demo/tiledb_minio_demo/' already exists


Just comment out the following line and you can re-run it multiple times.

# tiledb.from_numpy(array_minio, np.array([1.0, 2.0, 3.0]), ctx=tiledb.Ctx(config))


% python3 tiledb-minio-demo.py

[1. 2. 3.]


% python3 tiledb-minio-demo.py

[1. 2. 3.]


Check the MinIO Play bucket to make sure the data is in there as expected


% docker run minio/mc ls play/tiledb-demo/tiledb_minio_demo/

[2024-01-15 16:45:04 UTC] 	0B __commits/

[2024-01-15 16:45:04 UTC] 	0B __fragments/

[2024-01-15 16:45:04 UTC] 	0B __schema/


There you go, getting data into MinIO is that simple. Did you get the same results as earlier? You should have, but if you didn't there are a few things you can check out.

Common Pitfalls

We’ll look at some common errors you might encounter while trying to read/write to MinIO.


If your access key and secret key are incorrect, you should expect to see an error message like below


tiledb.cc.TileDBError: [TileDB::S3] Error: Error while listing with prefix 's3://tiledb-demo/tiledb_minio_demo/__schema/'... The request signature we calculated does not match the signature you provided. Check your key and signing method.


Next, you need to ensure the hostname and port are correct, without a proper endpoint these are the errors you would encounter.


Incorrect Hostname:

tiledb.cc.TileDBError: [TileDB::S3] Error: … Couldn't resolve host name


Incorrect Port:

tiledb.cc.TileDBError: [TileDB::S3] Error: … Couldn't connect to server


Last but not least, one of the most cryptic errors I’ve seen is the following

tiledb.cc.TileDBError: [TileDB::S3] Error: … [HTTP Response Code: -1] [Remote IP: 98.44.32.5] : curlCode: 56, Failure when receiving data from the peer


After a ton of debugging it turns out that if you are connecting using http but the MinIO server has TLS activated then you will see the above error. Just be sure the connection scheme is set to the right configuration, in this case, config["vfs.s3.scheme"] = "https".

Racks on Racks on Racks

There is a rap song (you can search for it) where they rap about having stacks on stacks on stacks of *cough* cash. But there is another rap song where they claim they have so many stacks of cash that they can’t be called “stacks” anymore, they are now “racks”. Essentially when your stacks get so big and so high you need racks on racks on racks to store your stacks of cash.


This is an apt comparison because your stacks of data mean as much (or more) to you as the stacks of cash they're rapping about. If only there was something like MinIO to keep all your objects – physical or virtual – safe and readily accessible.


With MinIO in the mix, you can easily scale TileDB to multiple racks across multiple datacenters with relative ease. You also get all the features that make MinIO great like Security and Access ControlTieringObject Locking and RetentionKey Encryption Service (KES), among others right out of the box. By having all your data in MinIO, you decrease required storage complexity and therefore realize considerable savings on data storage costs, while at the same time running MinIO on commodity hardware provides the best possible performance-to-cost ratio. MinIO supercharges your TileDB engine with industry-leading performance that makes querying a joy.


We’ve added the code snippets used in this blog to a git repository. If you have any questions on how to connect MinIO to TileDB or migrate data into MinIO be sure to reach out to us on Slack!


Also appears here.