Built on Lance, an open-source columnar data format, LanceDB has some interesting features that make it attractive for AI/ML. For example, LanceDB supports explicit and implicit vectorization with the ability to handle various data types. LanceDB is integrated with leading ML frameworks such as
LanceDB is capable of querying data in S3-compatible object storage. This combination is optimal for building high-performance, scalable, and cloud-native ML data storage and retrieval systems. MinIO brings performance and unparalleled flexibility across diverse hardware, locations, and cloud environments to the equation, making it the natural choice for such deployments.
Upon completion of this tutorial, you will be prepared to use LanceDB and MinIO to joust with any data challenge.
The
One of its
Vector databases like LanceDB offer distinct advantages for AI and machine learning applications, thanks to their efficient
Natural Language Processing (NLP):
Semantic Search: Find documents or passages similar to a query based on meaning, not just keywords. This powers chatbot responses, personalized content recommendations, and knowledge retrieval systems.
Question Answering: Understand and answer complex questions by finding relevant text passages based on semantic similarity.
Topic Modeling: Discover latent topics in large text collections, useful for document clustering and trend analysis.
Computer Vision:
Image and Video Retrieval: Search for similar images or videos based on visual content, crucial for content-based image retrieval, product search, and video analysis.
Object Detection and Classification: Improve the accuracy of object detection and classification models by efficiently retrieving similar training data.
Video Recommendation: Recommend similar videos based on the visual content of previously watched videos.
Among the plethora of vector databases on the market, LanceDB is particularly well suited for AI and machine learning, because it supports querying on S3- compatible storage. Your data is everywhere, your database should be everywhere too.
Using MinIO with LanceDB offers several benefits, including:
The combination of MinIO and LanceDB provides a high-performance scalable cloud-native solution for managing and analyzing large-scale ML datasets.
To follow along with this tutorial, you will need to use
Ensure that Docker Compose is installed by running the following command:
docker compose version
You will also need to install Python. You can download Python from
Optionally, you can choose to create a Virtual Environment. It's good practice to create a virtual environment to isolate dependencies. To do so, open a terminal and run:
python -m venv venv
To Activate the virtual environment:
On Windows:
.\venv\Scripts\activate
On macOS/Linux:
source venv/bin/activate
Begin by cloning the project from
docker-compose up minio
This will start up the MinIO container. You can navigate to ‘
Log in with the username and password minioadmin:minioadmin
.
Next, run the following command to create a MinIO bucket called lance
.
docker compose up mc
This command performs a series of
Here's a breakdown of each command:
until (/usr/bin/mc config host add minio http://minio:9000 minioadmin minioadmin) do echo '...waiting...' && sleep 1; done;: This command repeatedly attempts to configure a MinIO host named minio
with the specified parameters (endpoint, access key, and secret key) until successful. During each attempt, it echoes a waiting message and pauses for 1 second.
/usr/bin/mc rm -r --force minio/lance;: This command forcefully removes (deletes) all contents within the lance
bucket in MinIO.
/usr/bin/mc mb minio/lance;: This command creates a new bucket named lance
in MinIO.
/usr/bin/mc policy set public minio/lance;: This command sets the policy of the lance
bucket to public, allowing public read access.
exit 0;: This command ensures that the script exits with a status code of 0, indicating successful execution.
Unfortunately, LanceDB does not have native S3 support, and as a result, you will have to use something like boto3 to connect to the MinIO container you made. As LanceDB matures we look forward to native S3 support that will make the user experience all the better.
The sample script below will get you started.
Install the required packages using pip. Create a file named requirements.txt with the following content:
lancedb~=0.4.1
boto3~=1.34.9
botocore~=1.34.9
Then run the following command to install the packages:
pip install -r requirements.txt
You will need to change your credentials if your method of creating the MinIO container differs from the one outlined above.
Save the below script to a file, e.g., lancedb_script.py
.
import lancedb
import os
import boto3
import botocore
import random
def generate_random_data(num_records):
data = []
for _ in range(num_records):
record = {
"vector": [random.uniform(0, 10), random.uniform(0, 10)],
"item": f"item_{random.randint(1, 100)}",
"price": round(random.uniform(5, 100), 2)
}
data.append(record)
return data
def main():
# Set credentials and region as environment variables
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
os.environ["AWS_ENDPOINT"] = "http://localhost:9000"
os.environ["AWS_DEFAULT_REGION"] = "us-east-1"
minio_bucket_name = "lance"
# Create a boto3 session with path-style access
session = boto3.Session()
s3_client = session.client("s3", config=botocore.config.Config(s3={'addressing_style': 'path'}))
# Connect to LanceDB using path-style URI and s3_client
db_uri = f"s3://{minio_bucket_name}/"
db = lancedb.connect(db_uri)
# Create a table with more interesting data
table = db.create_table("mytable", data=generate_random_data(100))
# Open the table and perform a search
result = table.search([5, 5]).limit(5).to_pandas()
print(result)
if __name__ == "__main__":
main()
This script will create a Lance table from randomly generated data and add it to your MinIO bucket. Again, if you don’t use the method in the previous section to create a bucket you will need to do so before running the script. Remember to change the sample script above to match what you name your MinIO bucket.
Finally, the script opens the table, without moving it out of MinIO, and uses Pandas to do a search and print the results.
The result of the script should look similar to the one below. Remember that the data itself is randomly generated each time.
vector item price _distance
0 [5.1022754, 5.1069164] item_95 50.94 0.021891
1 [4.209107, 5.2760105] item_100 69.34 0.701694
2 [5.23562, 4.102992] item_96 99.86 0.860140
3 [5.7922664, 5.867489] item_47 56.25 1.380223
4 [4.458882, 3.934825] item_93 9.90 1.427407
There are many ways to build on this foundation offered in this tutorial to create performant, scalable and future-proofed ML/AI architectures. You have two cutting-edge and open-source building blocks in your arsenal – MinIO object storage and the LanceDB vector database – consider this your winning ticket to the ML/AI
Don’t stop here. LanceDB offers a wide range of
Please show us what you’re building and should you need guidance on your noble quest don’t hesitate to email us at [email protected] or join our round table on