Graph Theory-Based Semantic Caching: Scaling LLM Applications

While analyzing query patterns in LLM-powered applications, I discovered a troubling trend that many developers overlook. Applications process thousands of semantically identical questions with completely different phrasings. "How do I reset my password?" appears alongside "What's the password recovery process?" and "I forgot my password, help me recover it?" - all essentially the same request, but each triggering separate LLM API calls. The math quickly becomes painful as each query costs money.

Traditional string-based caching systems fail to recognize these semantic similarities. Every slightly different phrasing results in a fresh API call, causing costs to spiral upward with what should be cached responses. This is where the real challenge lies - how do you cache responses for questions that mean the same thing but are worded differently?

This challenge led me to explore an unconventional approach - combining graph theory with semantic similarity search. Instead of treating cached responses as isolated entries, what if they formed a connected graph where semantically similar queries could reference each other? Early exploration proved promising.

Graph-based semantic caching could theoretically reduce both API costs and response latency while scaling more efficiently than traditional approaches. This article dives into the technical implementation using Redis and the surprising performance characteristics that emerged from testing.

Why String Matching Kills Your Cache Hit Rate

A lot of Simple Caching mechanisms are applied like this today:

import hashlib

class SimpleCache:
    def __init__(self):
        self.cache = {}
    
    def get(self, query: str):
        key = hashlib.md5(query.encode()).hexdigest()
        return self.cache.get(key)
    
    def set(self, query: str, response: str):
        key = hashlib.md5(query.encode()).hexdigest()
        self.cache[key] = response

# Usage
cache = SimpleCache()
cache.set("How do I reset my password?", "Go to Settings > Security...")

# These all miss the cache:
cache.get("What's the password recovery process?")  # None
cache.get("I forgot my password, help me recover it?")  # None
cache.get("Password reset procedure please")  # None

The brutal reality: These queries are asking for identical information, but your cache treats them as completely different requests. Each cache miss costs you an LLM API call.

The fundamental problem? String matching ignores meaning. We needed something that understands that "reset password" and "password recovery" are semantically equivalent.

The Breakthrough- Graph Theory Meets Redis

The breakthrough came from treating queries as a connected graph. Each query becomes a node, and edges connect semantically similar queries with similarity scores as weights. Instead of checking 100 cached items linearly, you might only check 12 strategically chosen nodes. Graph traversal scales sub-linearly while maintaining high accuracy.

Why this works: Similar queries naturally cluster together. If "reset password" is similar to your query, its neighbors like "password recovery" probably are too.

Redis as Your Graph Engine

Redis's native sorted sets and hashes make it surprisingly perfect for graph operations without needing specialized graph databases. Hence, Redis turns out to be perfect for this. Here's how the data structures map:

# Nodes: Store query data as Redis hashes
# Key: "node:abc123"
redis.hset("node:abc123", {
    "query": "How do I reset my password?",
    "response": "Go to Settings > Security...",
    "embedding": "[0.1, 0.4, -0.2, ...]",  # JSON array
    "timestamp": "1642534800"
})

# Edges: Store neighbors as Redis sorted sets (score = similarity)
# Key: "edges:abc123" 
redis.zadd("edges:abc123", {
    "def456": 0.85,  # "password recovery" query with 0.85 similarity
    "ghi789": 0.72,  # "forgot password" query with 0.72 similarity
    "jkl012": 0.68   # "login help" query with 0.68 similarity
})

Strategic Graph Building

This approach avoids the O(n²) explosion of connecting every node to every other node by being selective about which connections to create.

Here's the key insight: Don't connect every node to every other node. That's O(n²) and kills performance.

def add_query_to_graph(new_query, response):
    query_hash = hash(new_query)
    embedding = get_embedding(new_query)
    
    # Strategy 1: Connect to recent nodes (likely relevant)
    recent_nodes = redis.lrange("recent_nodes", 0, 9)  # Last 10 nodes
    
    # Strategy 2: Sample random nodes for diversity  
    all_nodes = redis.smembers("all_nodes")
    if len(all_nodes) > 20:
        random_sample = random.sample(all_nodes, 10)
    
    candidates = recent_nodes + random_sample
    
    # Only calculate similarity for strategic subset
    for existing_hash in candidates:
        existing_data = redis.hgetall(f"node:{existing_hash}")
        similarity = cosine_similarity(embedding, existing_data['embedding'])
        
        if similarity > 0.1:  
            # Create bidirectional edges with similarity weights
            redis.zadd(f"edges:{query_hash}", {existing_hash: similarity})
            redis.zadd(f"edges:{existing_hash}", {query_hash: similarity})
    
    # Store the new node
    redis.hset(f"node:{query_hash}", {
        "query": new_query,
        "response": response, 
        "embedding": json.dumps(embedding.tolist())
    })

Here's how the semantic graph looks after adding several queries:

Smart Graph Traversal

Search becomes intelligent graph walking. The algorithm leverages pre-computed edge weights to prioritize node exploration, turning an O(n) linear search into a guided traversal that follows similarity gradients. This lets you skip entire regions of semantically unrelated cached queries.

def find_similar_cached(query):
    query_embedding = get_embedding(query)
    
    # Start with promising candidates
    recent_nodes = redis.lrange("recent_nodes", 0, 2)  # Last 3 nodes
    
    for start_node in recent_nodes:
        # Check the starting node
        similarity = check_similarity(query_embedding, start_node)
        if similarity > threshold:
            return get_cached_response(start_node)
        
        # Follow the strongest edges (highest similarity neighbors)
        neighbors = redis.zrevrange(f"edges:{start_node}", 0, 1)  # Top 2 neighbors
        
        for neighbor in neighbors:
            similarity = check_similarity(query_embedding, neighbor) 
            if similarity > threshold:
                return get_cached_response(neighbor)
    
    return None  # Cache miss

The magic: Instead of checking 100+ cached queries linearly, you check ~12 strategically chosen nodes. The graph structure guides you to relevant clusters while skipping unrelated areas.

Redis advantages: Sorted sets give you instant access to top-k neighbors. Hash storage keeps node data together. Everything scales horizontally.

Conclusion

Performance Results

Testing this approach with 200 diverse queries revealed the power of graph-based semantic caching. The results were striking:

Search Efficiency: The graph algorithm checked an average of 12.1 nodes compared to 42 nodes required for linear search - a 71.3% reduction in computational overhead. This translates to a 3.5x speedup in cache lookup operations.

Cost Impact: With a 44.8% cache hit rate on semantic matches, LLM API costs dropped significantly. Instead of 210 expensive API calls, only 116 were needed - saving 44.8% in operational costs.

Scalability: Most importantly, this efficiency improves as your cache grows. Linear search gets slower with more cached items, but graph traversal maintains consistent performance by intelligently skipping irrelevant regions.

Production Considerations

When to use graph-based caching:

High query volume with semantic variations (customer support, documentation, FAQs)
Cost-sensitive applications where LLM API bills matter
Scenarios where slight response delays (50-100ms) are acceptable for huge cost savings

Configuration tips:

Similarity threshold of 0.7 works well for most use cases
Connect each node to 10-15 neighbors for optimal graph connectivity
Use embeddings with 512 dimensions to balance accuracy and storage

The bottom line: Graph theory transforms semantic caching from a brute-force problem into an intelligent search challenge. By treating similar queries as connected neighbors rather than isolated strings, you can dramatically reduce both costs and latency while maintaining high accuracy.

This approach opens new possibilities for efficient semantic search at scale, proving that sometimes the best solution isn't better algorithms - it's better data structures.

Repository: https://bitbucket.org/agent-projects/semcache/