I built a natural language chat bot reporting system, powered by Retrieval Augmented Generation(RAG). It provided a great experience to the user, they were able to query the data and get reports diced and sliced by the dimensions which they wanted. It was intuitive and provided efficiency. It turned tons of data into liquid gold.
Then, the LLM API bill arrived.
That's when I realized that in my quest to build this intelligence, I had created a financial black hole.
In today’s IT world, each team is looking to build an Agentic AI system. In each environment these Agentic AI systems make hundreds and thousands of LLM calls, starting with development, eval, user acceptance testing and then in the production environment.
In order to generate grounded answers, we implement Retrieval Augmented Generation(RAG). This means we query vector stores for similar content and pass the retrieved data with user’s questions, prompts, examples, guardrails instructions, structured response schema etc. to LLM. This means for a simple question, we add hundreds of additional tokens in the name of context, guardrails and few shot prompting etc. Large Language models (LLM) parse these as tokens and do its job to generate the response and then we get the answer to our question. This response can be an answer to our question or it can be a structured response to one of the steps in a multi agentic workflow. That means in order to execute one workflow with multiple agents invoking multiple tools, thousands of token usage happens. All of this boils down to high token usage and high cost, and which creates question on Return on Investment(ROI).
In my system, the system pulls data from Vector DB and Knowledge Graph to ground the answer. I observed around ~2000 input tokens in the context window. The LLM then generates a thoughtful, 500 token response. At first glance, that’s just a few cents. But with 200,000 inquiries a day, those "cents" transform into a monster. Just add that usage with multiple users during development and evaluation phases also. Currently cost per million token usage as input and output looks cheaper as we are in early stages of LLM api adoption and service providers have deliberately given lots of discounts. In future these prices will go higher, and then the overall cost from development to production to support and maintenance of these systems will drastically increase.
You can also refer to a couple of articles citing the challenges around the LLM cost.
The LLM Pricing War Is Hurting Education—and Startups
How scaling enterprise AI with the wrong LLM could cost you
I built RAG based systems using LangGraph, Neo4j, ChromaDB, and Redis. I bring this architecture pattern named Semantic Cache and implementation details to reduce token burn in production. This pattern can reduce your LLM usage cost by caching and responding to similar questions instead of invoking LLM each and every time. Let me present this pattern to you which you can apply in multiple use cases and in multiple touch points in your Agentic AI workflow.
What is Cache
Cache has been there for ages in our IT System architecture. It started with In-Memory, then it became distributed. Initially there were few limited offerings for on-prem or VM based deployment. With the advent of cloud, many cloud offerings evolved. In today’s time, we have quite matured caching systems. However, usage of the caching system has been based on key value pairs. Keys are hashed and stored in a Hash table kind of data structure. Caching was always an exact key based lookup. If your key is present, the cache system will return the values for your key, and if the key does not, that means it was not stored or it was evicted. However, it was never meant for similarity based lookup.
What is Semantic Cache
Vectors with multiple dimensions and finding vectors using cosine similarity or similar mechanisms also existed for a long time. However with LLM and RAG, vector embeddings and similarity search has become quite popular. Vector RAG is one of the primary choices of solution for RAG based systems. This helps in finding similar content. This is not an exact key look up, instead this helps in finding the embedding vectors which are semantically similar.
This capability has been initially developed in Vector databases. But now, this capability is also available in caching systems like redis and now cloud service providers also have introduced this capability in their cache offerings.
We can define Semantic Cache as a cache which fetches cache entries based on the meaning of a content or document instead of the hash of the key.
How Semantic Cache is different
Let’s do a quick differentiation between Cache, Semantic Cache and RAG.
Following tables can help in clarifying differences between Traditional Cache and Semantic Cache.
|
Factors |
Traditional Cache |
Semantic Cache |
|---|---|---|
|
Approach |
Key based |
Meaning based |
|
Key Search |
Exact match |
Similar match |
|
Best For |
APIs |
LLMs |
|
Example |
Key, Value (Student_Id, Student_Obj) |
Statements |
Following table can help in clarifying differences between RAG and Semantic Cache.
|
Factors |
RAG |
Semantic Cache |
|---|---|---|
|
Use |
To generate new response |
To return previously generated and cached answer |
|
Data |
External data sources like Vector databases |
Previously generated Questions and Answer |
|
Cost |
Medium to High |
Low |
|
Best For |
Grounded response generation |
Repeated but similar question |
We can see that the difference between RAG and Semantic cache is simple but the financial impact is big. RAG retrieves raw information and generates responses. If your questions are repetitive, caching answers wins because it bypasses retrieval, tool orchestration, and generation.
Semantic Cache based RAG Architecture
I’ve designed the following architecture for semantic cache on top of standard RAG architecture.
Components of the architecture
- User: From any client like browser or mobile application, an Q&A api is invoked once a user enters any question in natural language.
- API Gateway: API request is authenticated and authorised in Gateway
- GraphRAG + VectorRAG Agent System(FastAPI)
- A LangGraph ReAct agent, this orchestrates the workflow. It decides whether to check the cache, query the graph store or vector store, or respond to the user.
- Guardrails: Implemented via system prompts and specific instruction sets to ensure safe and accurate responses.
- A LangGraph ReAct agent, this orchestrates the workflow. It decides whether to check the cache, query the graph store or vector store, or respond to the user.
- MCP (Model Context Protocol):
- Semantic Cache MCP Client: Connects to the cache server using Server-Sent Events (SSE).
- Semantic Cache MCP Server: A FastMCP server that exposes tools (cache_lookup, cache_store) and manages cache server(Redis) interactions.
- Semantic Cache Server(Redis): Acts as a Semantic Cache. It stores vector embeddings of questions and corresponding answers to provide fast retrieval for similar queries.
- LLM: The reasoning engine for the agent and the generator for Cypher queries for GraphRAG and natural language responses.
- Observability(LangSmith): Provides tracing, monitoring, and debugging for the agent's execution steps.
- The Knowledge Graph(Neo4j): It stores structured relationships (e.g., (Person)-[:WORKS_AT]->(Organization)) and is queried via Cypher generated by the LLM.
- Vector Store(ChromaDB): It stores the vector embeddings generated from the knowledge document.
In order to implement the semantic caching, mainly the new semantic cache MCP component has been added in the standard RAG based Agent System.
- The agent first does a cache look up.
- Semantic Cache MCP client uses cache_lookup tool to check if the question is already answered and cached
- Semantic Cache MCP Server uses CACHE_SIMILARITY_THRESHOLD of >95% to decide whether a similar question is cached in the server. I tested similarity thresholds ranging from 80% to 99%. Below 90%, false positives increased and I started getting incorrect cached answers for loosely similar phrases. Above 97%, cache hit rates were significantly bad. In my use case, finally 95% provided the best trade off between precision and reuse. The right balance, however, depends on your specific use case and data distribution.
- If the cache hit happens, then it returns the answer without making any RAG call, which means neither Vector db nor knowledge graph call nor LLM call. The most important LLM call is avoided.
- If the cache miss happens, then it executes the standard RAG and generates the response. It ensures to store the question with the answer in the cache server, which gets used for subsequent calls.
The following code snippet shows the MCP server code for cache lookup. I’ve used KNN based similarity search and returned only 1 closest match. I use distance as score and (1-score) as similarity. One more important point is the use of dialect 2 which is a must for vector queries in Redis, otherwise the query will fail.
@mcp.tool()
def cache_lookup(question: str) -> str:
"""
Look up a semantically similar question in the cache.
Args:
question: The question to look up
Returns:
JSON with cached answer if found (similarity >= 95%), or cache miss indication
"""
global _hits, _misses
client = get_redis_client()
ensure_index_exists(client)
# Compute embedding for the query
query_embedding = compute_embedding(question)
query_vector = np.array(query_embedding, dtype=np.float32).tobytes()
# Vector similarity search
try:
q = (
Query(f"*=>[KNN 1 @embedding $vec AS score]")
.return_fields("question", "answer", "expires_at", "score")
.sort_by("score")
.dialect(2)
)
results = client.ft(INDEX_NAME).search(
q,
query_params={"vec": query_vector}
)
if results.total > 0:
doc = results.docs[0]
score = float(doc.score)
similarity = 1 - score # Convert distance to similarity
expires_at = float(doc.expires_at)
# Check expiration
if time.time() > expires_at:
logger.info(f"Cache entry expired for: '{question[:50]}...'")
# Delete expired entry
client.delete(doc.id)
_misses += 1
return json.dumps({
"found": False,
"reason": "expired"
})
# Check similarity threshold
if similarity >= CACHE_SIMILARITY_THRESHOLD:
_hits += 1
logger.info(f"Cache HIT: similarity={similarity:.4f} for '{question[:50]}...'")
return json.dumps({
"found": True,
"answer": doc.answer.decode() if isinstance(doc.answer, bytes) else doc.answer,
"similarity": round(similarity, 4),
"original_question": doc.question.decode() if isinstance(doc.question, bytes) else doc.question
})
_misses += 1
logger.info(f"Cache MISS for: '{question[:50]}...'")
return json.dumps({
"found": False,
"reason": "no_similar_question"
})
except Exception as e:
logger.error(f"Cache lookup error: {e}")
_misses += 1
return json.dumps({
"found": False,
"reason": f"error: {str(e)}"
})
The following code snippet shows the MCP server code for cache store. I’ve first calculated the embeddings of the question, which gets stored along with the raw question and answer as value for the key which is the standard md5 hash of the question.
@mcp.tool()
def cache_store(question: str, answer: str) -> str:
"""
Store a question-answer pair in the semantic cache.
Args:
question: The original question
answer: The answer to cache
Returns:
JSON confirmation of storage
"""
client = get_redis_client()
ensure_index_exists(client)
try:
# Compute embedding
embedding = compute_embedding(question)
embedding_bytes = np.array(embedding, dtype=np.float32).tobytes()
# Generate unique key
key_hash = hashlib.md5(question.encode()).hexdigest()[:12]
cache_key = f"cache:{key_hash}"
current_time = time.time()
# Store in Redis hash
client.hset(cache_key, mapping={
"question": question,
"answer": answer,
"embedding": embedding_bytes,
"created_at": current_time,
"expires_at": current_time + CACHE_TTL_SECONDS
})
# Set TTL on the key
client.expire(cache_key, CACHE_TTL_SECONDS)
logger.info(f"Cached answer for: '{question[:50]}...' (TTL: {CACHE_TTL_SECONDS}s)")
return json.dumps({
"stored": True,
"key": cache_key,
"ttl_seconds": CACHE_TTL_SECONDS
})
except Exception as e:
logger.error(f"Cache store error: {e}")
return json.dumps({
"stored": False,
"error": str(e)
})
The following snapshot shows caching of questions and answers in Redis with embeddings.
Following snippet shows MCP client code to invoke cache_lookup
def cache_lookup(self, question: str) -> dict:
return _run_async(self._req_with_connection("cache_lookup", question))
Following snippet shows MCP client code to invoke cache_store
def cache_store(self, question: str, answer: str) -> dict:
return _run_async(self._req_with_connection("cache_store", question, answer))
The following trace shows us the execution flow.
Trace is highlighted at cache_lookup call, and you can see the output as cache_hit as false with reason no_similar_question. Thus in this flow, standard RAG is executed. You can see graph_rag_query node in the trace followed by an LLM call.
Now, it's the moment of truth, when a similar question is asked again. We can see in the following snapshot, cache_hit is true. There is a slight change in wording of the question, but it’s similar to the previous one, hence the similarity score is 0.9866 as shown below. Hence the answer is directly retrieved from the cache and you will not see the RAG flow execution. Answer is directly returned from the semantic cache.
Result
Adding semantic cache to a GraphRAG or VectorRAG system provided a measurable performance and cost benefits. Previously, the system took 5 to 6 seconds per request, which includes Vector search, Knowledge graph cypher query generation and execution, and final answer generation by the LLM. When the cache hit happened, answers were on the user's screen in 900 ms to 1.2 seconds. On average, this provided a 24% reduction in daily LLM calls, which slashed the LLM API bills.
Semantic Cache Strategy
Where Semantic Cache Wins
When a system handles high volume, predictable traffic, a cache layer adds a performance benefit. It wins when the same knowledge is asked by many users, like in the "Customer Support & Helpdesks" use case, where questions are like "How do I reset my password?", "What is your refund policy?". In Product Documentation use cases, where users are querying the same product manuals. This even happens when developers query the same technical documentation and API docs. In Internal HR & Onboarding use cases, where questions like "How do I enroll in health insurance?" and "What is the holiday schedule?" pop up by each employee. In Compliance & Policy Q&A, where your system provides standardized, vetted answers to regulatory or company policy inquiries.
When to Avoid Semantic Caching
It fails specifically in those systems which provide hyper personalization. If two users ask the same question but require different answers based on their medical history or profile or account balance, a semantic cache can provide incorrect answers. It even fails when your documents, knowledge base and data are volatile or it's real time, like when you need answers about stock prices or live inventory, a cache is your enemy. You need the RAG to see the current state of the world.
Semantic Cache Invalidation Strategy
A cache needs to be refreshed and semantic cache needs additional invalidation strategy. Every cache entry must be tagged with a created_at timestamp to manage its time to live(TTL) values. In semantic cache additionally document_version and embedding_model used during generation also should be tagged. This is critical because if you upgrade the embedding model, your vector space shifts. This makes the old cache entries obsolete. We need to track these variables, such that we can trigger cache invalidation whenever a source document is updated or the embedding model is changed or updated. This ensures that our system does not respond with inaccurate answers.
Authorisation and Security Consideration
In any caching system, you must ensure that it authorises before it provides the data to the user. It also must ensure that data cached is secured at rest also. Implementation of semantic cache must adhere to these principles, it should not just be seen as adding a layer for performance or cost benefits.
In a multi-tenant system, tenant isolation keys must be implemented in cache. Similarly in hyper personalised user's system, user level namespace partitioning must be done in the cache. This guarantees that the vector similarity search is restricted to the specific bucket belonging to that specific user of the corresponding organisation.
The most important point about authorisation is creation of permission-aware cache keys, where the hash of the key includes the user's specific roles or access levels. This ensures that a user with less privilege does not get a cached answer from a more privileged user. A cache hit must not be an unauthorized data fetch mechanism. If a user does not have permission to fetch a particular document or database row via a RAG pipeline, then they must not be able to pull the generated answer from the cache.
Finally to meet various compliance requirements like HIPPA or SOC2, your answers must be encrypted before it's cached. It's not negotiable at all. The semantic cache must use the same cryptographic practice as your primary data store.
Conclusion
The architecture blueprint, code snippets and the various snapshots proves that it’s possible to semantically cache the questions and do a vector embeddings based similarity search to return the previously generated answers to avoid RAG based LLM calls for similar questions.
The example shown above is pretty simple, however we can extend this architecture choice for multiple LLM touch points which happens in the Agentic AI system to make decisions not only for the end user questions and answers.
Semantic caching is a simple way to lower token costs while maintaining response quality. If your system gets similar queries to answer, adding a cache layer before the retrieval and generation steps is an advantage. The easiest part is that you don’t have to overhaul your RAG architecture, you just optimize the flow by handling repetitive intent earlier in the chain.
