paint-brush
Bases de données vectorielles : premiers pas avec ChromaDB et pluspar@shyamganesh
12,408 lectures
12,408 lectures

Bases de données vectorielles : premiers pas avec ChromaDB et plus

par Shyam Ganesh S5m2024/02/13
Read on Terminal Reader
Read this story w/o Javascript

Trop long; Pour lire

Dans cet article, nous explorerons un autre magasin de vecteurs bien connu appelé ChromaDB. Chroma DB est un magasin de vecteurs open source utilisé pour la récupération d'intégrations vectorielles. Nous allons créer un magasin de vecteurs, ajouter des collections, ajouter du texte à la collection et effectuer une recherche de requête avec et sans méta-filtrage.
featured image - Bases de données vectorielles : premiers pas avec ChromaDB et plus
Shyam Ganesh S HackerNoon profile picture
0-item
1-item

Hello, readers; I would like to extend a warm welcome to all of you for reading my blog post. This particular post is my second one, focusing on vector databases. In my previous blog post, I covered the concepts of vectors, vector databases, and vector search. I also provided a practical demonstration of the FAISS vector store using Python. Similarly, in this article, we will delve into another popular vector store known as ChromaDB.


Introduction to ChromaDB

In this part of the article, we will explore what ChromaDB is, its features, and how it works in real-time.

What is ChromaDB?

Chroma DB is a vector store that is open-source and is utilized for the storage and retrieval of vector embeddings. Its primary purpose is to store embeddings and associated metadata for future use by extensive language models. Furthermore, it can also be employed for semantic search engines that operate on text data.


What are the features of ChromaDB?

ChromaDB offers various storage options, such as DuckDB for standalone use or ClickHouse for scalability. It provides Python and JavaScript/TypeScript SDKs and emphasizes simplicity, speed, and analysis capabilities.

ChromaDB can store vectors with additional metadata and allows for filtering during the query search on the vector database.

Chroma DB provides various options for storing vector embeddings.

  1. The options include storing the vector database in-memory, where it is flushed when the RAM is refreshed.
  2. Storing it on the local file system and loading it into memory when needed.
  3. Another option is to host the database on a server machine, allowing clients to make requests to the server for each query.

How does ChromaDB work?

  • The initial step in vector databases is to create a collection, which is similar to a table in a relational database. By default, ChromaDB utilizes the all-MiniLM-L6-V2 model for generating embeddings. The embedding model can be customized according to the user's requirements.
  • Documents can be added to the collection, and if they are in text format, ChromaDB will automatically convert them into embeddings based on the specified embedding model. It is optional to include meta information when adding a document, but a unique document ID must be provided for identification purposes.
  • After creating and populating the collection with documents, we can begin querying the vector store to retrieve similar documents.
  • The process of filtering the documents while querying is referred to as meta-filtering, and it is also available as an option in ChromaDB.


In the next part, we will use Chroma and all-MiniLM-L6-V2 to create our own vector DB.


Getting started with ChromaDB

In this section, we will create a vector store, add collections, add text to the collection, and perform a query search with and without meta-filtering using in-memory ChromaDB.


As the first step, we will try installing the ChromaDB package.

!pip3 install chromadb


Once the package is installed, try importing the required packages into your working notebook.

import chromadb
import pandas as pd
import numpy as np


After importing the packages, we will now have a look into our dataset, which we feed to our vector store for training.

passage_data = pd.read_csv("/Users/Python_Programs/Text_Similarity/Stanford_Edited.csv")
passage_data


For this tutorial, I have selected the Stanford dataset.

The dataset has 37,792 rows of data and 3 columns named:

Id: Unique id to represent each document in the data store.

Content: The content of each document.

Language: The language in which the content is written (for filtering purpose).


dataset


For meta-filtering, we will have the language column in the dataset with values en_US as English and ru_RU as Russian. The data count in each of these labels is as follows:

language count in dataset


Having explored the dataset, now it is time to prepare the data in the format that the ChromaDB accepts.

To add a set of documents to the Chroma index,

documents - list of documents(text)

metadatas - list of dictionary that has unique ids to represent the documents (ids should be in String format)

documents = passage_data['Content'].tolist()
metadatas = []
ids = []

for idx in range(len(passage_data)):
    metadatas.append({"Language": passage_data['Language'][idx]})
    ids.append(str(passage_data['Id'][idx]))


The dataset and input documents with metadata and IDs are prepared. We will create a client connection to ChromaDB.

client = chromadb.Client()


Once the client is initialized, we create a new collection to hold the vectors(data) as an index.

collection = client.create_collection("Stanford_Collection")


It is time to add the documents to the index.

collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)


There are two approaches to querying the index: querying without filters and querying with filters (known as meta-filtering). We will now attempt both methods.

  1. Attempting a query without any filters.
  2. Attempting a query with filters (meta-filtering).


Now, we will test both approaches.

# 1. query without any filters
results = collection.query(
    query_texts=["University of Architecture"],
    n_results=3,
)
print(results)

Result of query without filters

The output displays three documents that have similarities to the query document ("University of Architecture"). It includes the document IDs, distances, and metadata of these documents.


Now, we will try a query with a language filter.

results = collection.query(
    query_texts=["University of Architecture"],
    n_results=3,
    where={"Language": "en_US"}, # optional filter
)
print(results)

result of query with filters

The output displays three documents that are in the English language for the given query. This is how meta-filtering works.


Few other functionalities

Having seen the main functionalities of ChromaDB, we will now explore a few other functionalities in short.


To update a document in the Chroma Index:

collection.update(
    ids=["2"],
    documents=["This document contains information about medical college"],
    metadatas=[{"Language": "en_US"}],
)

There is slight difference between update() and upsert() in ChromaDB.

In short, if you are trying to update a document which does not exist in the Index:

update() raises a exception, whereas upsert() tries to insert the document as new vector into the index.


To delete a document from the Chroma Index:

collection.delete(ids = ['1'])


To get the count of documents present in an index, we use the following:

print(collection.count())


To get the documents present in the collection, we use:

print(collection.get())


To modify the name of the existing collection:

collection.modify(name="new_collection_name")


To delete or drop an Index in ChromeDB, we use:

client.delete_collection(name="new_collection_name")


Conclusion

This article has presented an introduction to Chroma, a Vector Database that is open source. We have discussed the process of creating collections in Chroma and adding documents to them. Furthermore, we have examined how Chroma DB generates vector embeddings from the data. Lastly, we have gained knowledge on how to retrieve relevant information from a specific collection in the Vector Store based on a given query. I hope this blog post has aided you in exploring another vector database and understanding the concepts clearly. I look forward to seeing you on my next blog.


Until then, happy learning 👍🏻.