In the rapidly evolving world of AI, large language models have come a long way, boasting impressive knowledge of the world around us. Yet
LLMs are computer models capable of comprehending and generating human-like text. They're the AI behind your digital assistant, autocorrect function, and even some of your emails. Their knowledge of the world is often immense, but it isn't perfect. Just like humans, LLMs can reach the limits of their knowledge but, instead of stopping, they tend to make educated guesses or “hallucinate” to complete the task. This can lead to results that contain inaccurate or misleading information.
In a simple world, the answer would be to provide the model with relevant proprietary information at the exact time it's needed, right when the query is made. But determining what information is "relevant" isn’t always straightforward and requires an understanding of what the LLM has been asked to accomplish. This is where RAG comes into play.
Embedding models, in the world of AI, act like translators. They transform text documents into a large list of numbers, through a process known as "document encoding." This list represents the LLM’s internal "understanding" of the document's meaning. This string of numbers is known as a vector: a numeric representation of the attributes of a piece of data. Each data point is represented as a vector with many numerical values, where each value corresponds to a specific feature or attribute of the data.
While a string of numbers might seem meaningless to the average person, these numbers serve as coordinates in a high-dimensional space. In the same way that latitude and longitude can describe a location in a physical space, this string of numbers describes the original text’s location in semantic space, the space of all possible meanings.
Treating these numbers as coordinates enables us to measure the similarity in meaning between two documents. This measurement is taken as the distance between their respective points in the semantic space. A smaller distance would indicate a greater similarity in meaning, while a larger distance suggests a disparity in content. Consequently, information relevant to a query can be discovered by searching for documents "close to" the query in semantic space. This is the magic of vector similarity search.
RAG is a generative AIarchitecture that applies semantic similarity to automatically discover information relevant to a query.
In a RAG system, your documents are stored in a vector database (DB). Each document is indexed based on a semantic vector produced by an embedding model so that finding documents close to a given query vector can be done quickly. This essentially means that each document is assigned a numerical representation (the vector), which indicates its meaning.
When a query comes in, the same embedding model is used to produce a semantic vector for the query.
The model then retrieves similar documents from the DB using vector search, looking for documents whose vectors are close to the vector of the query.
Once the relevant documents have been retrieved, the query, along with these documents, is used to generate a response from the model. This way, the model doesn't have to rely solely on its internal knowledge but can access whatever data you provide it at the right time. The model is therefore better equipped to provide more accurate and contextually appropriate responses, by incorporating proprietary data stored in a database that offers vector search as a feature.
There are a handful of so-called “vector databases” available, including__DataStax Astra DB__, for which
Fine-tuning can be applied to the query encoder and result generator for optimized performance. Fine-tuning is a process where the model's parameters are slightly adjusted to better adapt to the specific task at hand.
Fine-tuning offers many benefits for optimizing LLMs. But it’s also got some limitations. For one, it doesn't allow for dynamic integration of new or proprietary data. The model's knowledge remains static post-training, leading it to hallucinate when asked about data outside of its training set. RAG, on the other hand, dynamically retrieves and incorporates up-to-date and proprietary data from an external database, mitigating the hallucination issue and providing more contextually accurate responses. RAG gives you query-time control over exactly what information is provided to the model, allowing prompts to be tailored to specific users at the exact time a query is made.
RAG is also more computationally efficient and flexible than fine-tuning. Fine-tuning requires the entire model to be retrained for each dataset update, a time-consuming and resource-intensive task. Conversely, RAG only requires updating the document vectors, enabling easier and more efficient information management. RAG's modular approach also allows for the fine-tuning of the retrieval mechanism separately, permitting adaptation to different tasks or domains without altering the base language model.
RAG enhances the power and accuracy of large language models, making it a compelling alternative to fine-tuning. In practice, enterprises tend to use RAG more often than fine-tuning.
Integrating RAG into LLMs doesn’t only improve the accuracy of their responses, but it also maximizes their potential. The process enables LLMs to focus on what they excel at intelligently generating content from a prompt. The model is no longer the sole source of information because RAG provides it with relevant proprietary knowledge when required, and the corpus of knowledge accessible to the model can be expanded and updated without expensive model-training jobs.
In essence, RAG acts as a bridge, connecting the LLM to a reservoir of knowledge that goes beyond its internal capabilities. As a result, it drastically reduces the LLM's tendency to “hallucinate” and provides a more accurate and efficient model for users.
DataStax recently announced the general availability of vector search capability in Astra DB. Learn about it
By Ryan Michael, DataStax