Achieving Relevant LLM Responses By Addressing Common Retrieval Augmented Generation Challenges

Developing generative AI applications with retrieval augmented generation (RAG) and vector databases can be challenging. You’ll often need to troubleshoot RAG implementations that rely on vector databases to ensure that relevant context is retrieved; this is critical, because that context is then included in a prompt to a large language model to provide more accurate results.

In the previous article in this series, we looked at the embedding pipeline, which populates the vector database with embeddings, and considered three areas that can lead to poor results: suboptimal embedding models, inefficient chunking strategies and lack of metadata filtering.

Here, we’ll look at the actual interaction with the LLM and examine a couple common problems that can lead to poor results: imprecise prompts and insufficient generation approaches.

Prompt refinement

Even the best embedding and chunking strategies can’t replace the need for quality prompt engineering. This involves making the prompts more explicit, specific and aligned with the intended output. Varied prompt formats, lengths and lexical choices should be tested to fine-tune the RAG process.

Specifically, there are several things you should consider when building your prompts for RAG applications. These include:

Tell the LLM its role: When interacting with LLM agents like ChatGPT, they will default to acting like a helpful chatbot. However, you can alter the nature of responses that will be generated by instructing the LLM to act in a particular way. Examples could be things like “you are an attorney evaluating if either party is in breach of an agreement” or “you are a customer service agent for an internet service provider; your job is to help people troubleshoot their internet issues” or whatever makes sense for your particular situation.

Explicitly tell the LLM to use the provided context: Make it clear to the LLM that you are providing context, and you want the generated response to reflect that context. You can do this by saying something like “your response should take into account the following context” followed by the context. Use examples: In the just-mentioned scenario of asking the LLM to act as an attorney evaluating contracts, you may want to include several examples in the prompt. For instance, you could provide an example of a contract that says payment is due within 30 days, but the buyer didn’t send their payment until 40 days after the contract was signed and therefore violated the agreement. You could provide additional examples of rights to remedy and possible ways to resolve contract violations.

Specify the output format: If your use case requires specific output, you can specify the format that the generated output should adhere to. You can combine this technique along with the above tip to provide examples to make it clear to the LLM both how you want it to respond, as well as the key points of information you’re expecting in the generated response.

Use chain of thought: For use cases where reasoning is involved to determine the appropriate response, you can consider using a technique called chain of thought to help explain the steps that you want the LLM to follow to arrive at a generated response. For example, in the legal contracts case, you may want to walk the LLM through the logical steps that a person would follow to determine if a contract term has been violated. For instance, when dealing with legal contracts you might tell the LLM to first look for clauses that specify payment terms, then identify the amount of time the buyer had to submit payment, then calculate the number of days between the time payment was received and the date the contract was signed. Then, if the payment took longer than the agreed upon time frame, the buyer violated the agreement.

Leveraging these techniques to improve your prompt engineering can have a significant impact on the quality of the generated results you’re able to produce in your RAG applications. However, sometimes you will need to use techniques that involve multiple exchanges with the LLM to achieve an acceptable response.

Advanced Patterns

FLARE

Forward-looking active retrieval, or FLARE, is an example of a multi-query RAG technique that involves iteratively calling the LLM with custom instructions in your prompt that ask the LLM to provide additional questions about key phrases that would help it generate a better answer. Once the LLM has context with no gaps, it terminates with the final response. The process adds a loop between the LLM and the AI agent (the AI chatbot in the diagram) to facilitate these iterations:

You can see an example of how FLARE works in the LangChain cookbook example of FLARE.

RAG Fusion

By generating similar queries to the user’s query and retrieving relevant context for both the original query as well as the generated similar queries, we are able to increase the likelihood that we have selected the most useful context to generate accurate results. The process, called “RAG fusion,” looks like this:

The key step here is to use a reciprocal rank function to further refine the ANN results to identify the most relevant context to use to generate a response.

Conclusion

RAG is an approach that’s simple to get started with, but often leaves developers frustrated with unsatisfactory results for unclear reasons. The relevance of outputs in RAG-enabled generative AI applications can be affected by several factors. By meticulously selecting embedding models, formulating chunking strategies and designing prompts, you can significantly elevate the quality and precision of generated responses in systems powered by LLMs like ChatGPT. Hopefully these tips will help you create more useful RAG applications that deliver the experience and value you’re aiming for.

Try DataStax Astra DB, the only vector database for building production-level AI applications on real-time data.

By Chris Latimer, DataStax

Also appears here.