Do You Speak Vector? Understanding the Language of LLMs and Generative AI

We’re fond of saying that there’s no artificial intelligence without data. But it can’t be any kind of data. Take large language models, or LLMs – deep learning models, like OpenAI’s GPT-4 that can generate text that’s quite similar to what a human would write.

For LLMs to “understand” words, they need to be stored as text “vectors” – a way of capturing words’ meanings and usage patterns with numbers. Vectors are, you might say, the lingua franca of AI.

Vectors have been around for awhile, but with the popularity and accessibility of the generative AI interface ChatGPT, they’ve become a hot topic, particularly because the most popular apps that organizations will build with these technologies will leverage their own private data for LLMs by composing their own vectors.

But how do they work, how are they stored, how do applications search for them, and how do they help make AI possible? Let’s dig into vectors, vector search, and the kinds of databases that can store and query vectors.

Vectors

A vector refers to a numeric representation of the attributes of a piece of data. Each data point is represented as a vector with multiple numerical values, where each value corresponds to a specific feature or attribute of the data.

When you transform data like an image or text into a vector representation, it’s known as an “embedding.”

The choice of image embeddings for vector search, for example, depends on various factors such as the specific use case, the available resources, and the characteristics of the image dataset.

In e-commerce or product image search applications, it can be beneficial to use embeddings specifically trained on product images; so-called instance retrieval, on the other hand, involves searching for instances of objects within a larger scene or images.

Storing data as vector representations enables you to perform various operations and calculations on the data, most importantly search. Selecting the vector attributes is important for the types of questions you’d like to be able to ask later.

For example, if you only store information about the colors in an image with plants, you can’t then ask about the care requirements. You’ll only be able to find visually similar plants.

Vector Search

By representing data as vectors, you can leverage mathematical techniques to efficiently search and compare very big datasets without having an exact match.

Millions of customer profiles, or images, or articles that are represented as vectors – a list of numbers that capture each item’s key characteristics – can be combed through very quickly with vector similarity search (or “nearest neighbor search”).

Unlike traditional keyword-based search, which matches documents based on the occurrence of specific terms, vector search focuses on the similarity of queries; for instance, are their semantic meanings similar?

This capability enables finding similar items based on their vector representations. Similarity search algorithms can measure the “distance” or similarity between vectors to determine how closely related they are.

In recommendation systems, vector search can be used to find the most similar and dissimilar items or users based on their preferences. In image processing, it enables tasks like object recognition and image retrieval.

For instance, Google, the world’s largest search engine, relies on vector search to power the backend of Google Image Search, YouTube, and other information retrieval services. (Learn more about vector search here.)

Vectors and Databases

There are standalone vector search technologies, including the likes of Elasticsearch. But vectors need to be stored in and retrieved from scalable and fast databases to deliver the responsiveness and scale demanded by AI applications.

There are a handful of databases today that offer vector search as a feature.

The main advantage of a database that enables vector search is speed. Traditional databases have to compare a query to every item in the database.

In contrast, integrated vector search enables a form of indexing and includes search algorithms that vastly speed up the process, making it possible to search massive amounts of data in a fraction of the time it would take a standard database.

In a business context, this is extremely valuable when using AI applications to recommend products that are similar to past purchases or identify fraudulent transactions that resemble known patterns, or anomalies that look dissimilar to the norm.

One example of a database that offers vector search is DataStax’s Astra DB, which is built on the highly scalable, high-throughput, open-source Apache Cassandra. Cassandra has already been proven at scale to power AI by the likes of Netflix, Uber, and Apple for AI applications.

The addition of vector search makes Astra DB a one-stop shop for high-scale database operations.

Integrating vector search with a scalable data store like Astra DB enables calculations and ranking directly within the database, eliminating the need to transfer large amounts of data to external systems.

This reduces latency and improves overall query performance. Vector search can be combined with other indexes within Astra DB for even more powerful queries. (Learn more about vector databases here.)

The Growing Importance of Vector Search

Vectors and the databases that store them play a big role in enabling efficient search, similarity calculations, and data exploration in the field of AI.

As organizations scale their generative AI efforts and look to customize the end-user experience with their data, vector representations and the ability to work with scalable, fast databases that are vector-search enabled will become increasingly critical.

By Dr. Charna Parkey, DataStax