There’s no artificial intelligence without data. And when your data is scattered all over the place, you’ll spend more time managing the implementation process instead of focusing on what’s most important: building the application. The world's most prominent applications already use Apache Cassandra, so increasing data efficiency is an increasingly important goal. AI is all about scale, and bringing vector search — a key component in using AI models — into Cassandra will help organizations slash costs, streamline their data management and squeeze every last drop of value from their data.
This cutting-edge feature, recently outlined in a Cassandra enhancement proposal (
The concept of text search has been around for a long time. It involves searching for a particular keyword within documents. But important data can be found in more than just text: audio, images, and video (or some combination) also contain relevant information that requires a search method. That’s where vector search comes in. It’s been in use
Also known as vector similarity search, there are two parts required to elevate your search game.
First, the raw data must be indexed into a vector representation (an array of numbers) that serves as a mathematical description.
Second, the vector data needs to be stored in a way that developers can ask, “Given one thing, what other things are similar?” It’s simple and powerful for developers, challenging to implement at scale on the server side. This is where Cassandra will really shine by consistently serving data at any scale around the world with resilience that grants peace of mind.
By no means is this meant to be a full deep dive into vector search, but more of an explanation of what it can do for your application by creating an entirely new dimension of useful data to reduce code complexity and get into production faster with features users want.
Real-world practical examples of vector search include:
Content-based image retrieval, where visually similar images are identified based on their feature vectors. Using a library like
Recommender systems, where products or content are recommended to consumers based on similarity to items they have previously interacted with.
Natural language processing applications, where semantic similarities between textual content can be identified and leveraged for tasks such as sentiment analysis, document clustering, and topic modeling. This is typically done using tools like
Use ChatGPT? Vector search is critical for the Large Language Model (LLM) use case as it enables efficient storage and retrieval of vector embeddings, representing the distilled knowledge gained during the LLM training process. By performing similarity searches, vector search can quickly identify the most relevant embeddings corresponding to a user's prompt.
This helps LLMs generate more accurate and contextually appropriate responses while also providing a form of long-term memory for the models. In essence, vector search is a vital bridge between LLMs and the vast knowledge bases on which they are trained.
The Cassandra project is on a never-ending quest to make Cassandra the ultimate powerhouse in the database universe. As previously mentioned, after you convert your data into vector embeddings, you’ll need a place to store and use them. Those capabilities are being added to Cassandra, exposed in a simple yet powerful way.
To support the storage of high-dimensional vectors, we’re introducing a new data type, `VECTOR<type, dimension>
`. This will enable the handling and storage of
CREATE TABLE products(
id UUID PRIMARY KEY,
name varchar,
description varchar,
item_vector VECTOR<float, 3>
);
We will add a new storage-attached index (SAI) called “VectorMemtableIndex,” which will accommodate the approximate nearest neighbor (ANN) search functionality. This index will work in conjunction with the new data type and Apache Lucene's Hierarchical Navigable Small World (HNSW) library to enable efficient vector search capabilities within Cassandra.
CREATE CUSTOM INDEX item_ann_index ON product(item_vector)
USING 'VectorMemtableIndex';
To make it easier for users to perform ANN searches on their data, we will introduce a new Cassandra Query Language (CQL) operator, ANN OF. This operator will allow users to efficiently perform ANN searches on their data with a simple and familiar query syntax. Continuing the example, developers can ask the database for something similar to a vector created from a description.
SELECT * FROM product WHERE item_vector ANN OF [3.4, 7.8, 9.1]
When Cassandra 4.0 was released, one of the easily overlooked highlights was the concept of increased pluggability. The new vector search functionality in Cassandra is built as an extension to the existing SAI framework, avoiding a rewrite of the core indexing engine. It uses the well-known and widely used
Cassandra 4's new addition highlights its remarkable modularity and extensibility. With the integration of HNSW Lucene and the expansion of the SAI framework, developers can now access a wide range of production-ready features much faster. Developers have access to numerous vector databases, and many of them prefer to build a vector indexing engine before adding storage. Cassandra has successfully tackled the challenging issue of data storage at scale for over a decade. We are highly confident that including vector search in Cassandra will provide even more exceptional production-ready features.
Cassandra isn’t new to machine learning and AI workloads. Long-time Cassandra users have been using Cassandra as a fast and efficient feature store for years. It’s even rumored that OpenAI uses Cassandra heavily in the building of LLMs. These use cases all employ Cassandra’s existing functionality. There will be many ways to use the new vector search. It will be exciting to see what our community comes up with but they will likely fit into two categories:
If you already have an application built on Cassandra, you can enhance its capabilities by incorporating ANN (“approximate nearest neighbor”) search. For instance, if you have a content recommendation system, you can use ANN search to find similar items and improve the relevance of your recommendations. Product catalogs can denormalize features into embedded vectors stored in the same record. Fraud detection can be further enhanced by mapping behaviors to features. Think of a use case and it is probably relevant.
If you’re starting a new project that requires fast similarity search capabilities, Cassandra's new vector search feature will be an excellent choice for data storage and retrieval. Knowing you can go from gigabytes to petabytes on the same system will let you focus on building your application and not worrying about tradeoffs. In addition to storing vector embeddings, you’ll have the full power of CQL and the tabular storage of a full-featured database all thrown in.
However you consume Cassandra, these options will all be available. If it’s your own deployment using open source Cassandra, deployed in Kubernetes using
As we continue to innovate and expand the capabilities of Cassandra, we remain committed to staying at the forefront of what you need in data management. The introduction of vector search is an exciting new use case that will make your data-driven applications even more powerful and versatile. This, with some of the other cutting-edge features like distributed
We are confident that this addition will help not only AI developers but also organizations managing large data sets that can benefit from fast similarity search. So keep an eye out for the alpha release of Cassandra with vector search functionality, slated for sometime in Q3. We look forward to seeing the fantastic applications you'll build with this new feature, and we’d love it if you shared your use cases with the community at
Also published by Patrick McFadin, DataStax here.