RAG Systems in Five Levels of Difficulty (With Full Code Examples)

Written by paoloap | Published 2026/01/24
Tech Story Tags: rag | ai-engineering | production-rag | smart-chunking | ai-hallucination-guardrails | rag-retrieval-failure-modes | rag-hybrid-search | rag-systems

TLDRMost RAG systems fail in production because “semantic similarity” isn’t the same as relevance, and retrieval breaks under real queries. This article outlines five escalating levels—naive RAG, smarter chunking with overlap and metadata, hybrid semantic + BM25 retrieval, cross-encoder reranking, and production guardrails that refuse or clarify when confidence is low—plus a testing approach to measure retrieval precision and answer accuracy. The core lesson: build, break, diagnose the failure mode, and level up until the system reliably grounds answers and knows when not to answer.via the TL;DR App

I’ve built RAG systems for internal docs, customer support, and compliance workflows. Most of them failed the first time. Some failed the second time, too.


Three months ago, I built one for our policy documentation. Embedded everything, wired up retrieval, connected it to GPT-4. Demo went great. Then, legal asked about our data retention policy. The system retrieved three chunks. Two were from a 2019 policy we’d replaced years ago. One was from an HR onboarding doc that mentioned “retention” in the context of employee retention. Completely different meaning, same word.


The answer it generated mixed outdated policy with irrelevant context. Confidently wrong. In front of the legal team. I rebuilt that system four times. Each rebuild fixed a specific failure mode I hadn’t anticipated. The gap between “works in demo” and “works in production” isn’t one thing. It’s five.

The Five Levels

  1. Naive RAG: The tutorial version. Breaks immediately on real queries.
  2. Smart Chunking: How you split documents determines what you can retrieve.
  3. Hybrid Search: When “semantically similar” isn’t the same as “actually relevant.”
  4. Reranking: A second pass that catches what the retrieval missed.
  5. Production RAG: What happens when retrieval fails? Don’t let the LLM improvise.

Alright, let’s dive in.

Level 1: Naive RAG

Embed your documents. Store vectors. Retrieve top-k by similarity. Generate.

from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
def index_document(doc_id: str, text: str):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
   collection.add(
        ids=[doc_id],
        embeddings=[response.data[0].embedding],
        documents=[text]
    )
def naive_rag(query: str, k: int = 3) -> str:
    # Embed query
    query_embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding
    
    # Retrieve
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=k
    )
    
    # Generate
    context = "\n\n".join(results["documents"][0])
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
            {"role": "user", "content": query}
        ]
    )
    return response.choices[0].message.content


This is every RAG tutorial. It’s also where most RAG systems stop (ask me how I know).


Where it breaks: Semantic similarity isn’t relevance. Query “data retention policy” and you’ll retrieve chunks about “employee retention programs” because embeddings see the word overlap. The concepts are unrelated but the vectors are close.


You’ll also retrieve chunks that are about the right topic but don’t answer the question. Three chunks about data retention, none of which mention the specific policy you need.


If your demo works, it’s because you tested queries you already knew the answer to.

Level 2: Smart Chunking

Most RAG failures look like retrieval failures. They’re actually chunking failures.


If you split every 500 tokens, you’ll cut a policy statement in half. Question on one chunk, answer on another. You’ll separate context from conclusion. You’ll create chunks that can’t stand alone.


Chunk size matters more than you think:

  • Too small (100–200 tokens): Chunks lack context. “Delete after 90 days” means nothing without knowing what gets deleted.
  • Too large (1000+ tokens): Chunks contain multiple topics. Retrieval pulls in noise alongside signal.
  • Sweet spot (300–500 tokens): Enough context to be useful, focused enough to be relevant.


But size isn’t the real trick. Overlap is.

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=100,  # This is the key
    separators=["\n\n", "\n", ". ", " ", ""]
)


100-token overlap means if a sentence gets split, both chunks contain it. The answer that landed at a chunk boundary is now retrievable from either side.


The metadata trick: Don’t just store text. Store where it came from.

def chunk_with_metadata(doc: str, source: str, doc_date: str) -> list[dict]:
    chunks = splitter.split_text(doc)
    return [
        {
            "text": chunk,
            "source": source,
            "date": doc_date,
            "section": extract_section_header(chunk),
        }
        for chunk in chunks
    ]


Now, when you retrieve chunks from 2019 alongside chunks from 2024, you can see it. Your prompt can say “prefer recent sources,” or your code can filter before generation.


(I covered data source selection for retrieval systems in more depth here.)

This alone fixed about 40% of our retrieval failures. Garbage in, garbage out. Better chunks meant better retrieval.

Query: “What’s our PTO policy for employees with 5+ years tenure?”


Semantic search finds chunks about time-off policies generally. Conceptually similar. Keyword search finds chunks containing “5+ years” and “tenure.” Exact matches. Neither alone finds the right answer. Together they do.

from rank_bm25 import BM25Okapi
import numpy as np
class HybridRetriever:
    def __init__(self, documents: list[str]):
        self.documents = documents
        self.embeddings = self._embed_all(documents)
        
        # BM25 for keyword matching
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
    
    def _embed_all(self, docs: list[str]) -> list[list[float]]:
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=docs
        )
        return [d.embedding for d in response.data]
    
    def search(self, query: str, k: int = 5, alpha: float = 0.5) -> list[str]:
        # Semantic scores (normalized)
        q_emb = client.embeddings.create(
            model="text-embedding-3-small",
            input=query
        ).data[0].embedding
        
        sem_scores = np.dot(self.embeddings, q_emb)
        sem_scores = (sem_scores - sem_scores.min()) / (sem_scores.max() - sem_scores.min() + 1e-8)
        
        # BM25 scores (normalized)
        bm25_scores = np.array(self.bm25.get_scores(query.lower().split()))
        if bm25_scores.max() > 0:
            bm25_scores = bm25_scores / bm25_scores.max()
        
        # Combine: alpha controls semantic vs keyword weight
        combined = alpha * sem_scores + (1 - alpha) * bm25_scores
        
        top_k = np.argsort(combined)[::-1][:k]
        return [self.documents[i] for i in top_k]


Tuning alpha:

  • Domain-specific jargon (legal, medical, internal acronyms) → lower alpha, more BM25
  • Natural language questions → higher alpha, more semantic
  • Start at 0.5, adjust based on what queries fail


This isn’t glamorous. No one writes blog posts about BM25 anymore. But it catches retrieval failures that pure vector search misses, especially when your users type exact phrases they expect to find.

Level 4: Reranking

You’ve retrieved 5 chunks. They’re all about the topic. But which ones actually answer the question?

Embedding similarity is computed independently. Each document scored against the query in isolation. A reranker looks at query and document together and asks: “Does this document answer this question?”

from sentence_transformers import CrossEncoder
class RerankedRetriever:
    def __init__(self, documents: list[str]):
        self.hybrid = HybridRetriever(documents)
        self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
    
    def search(self, query: str, k: int = 3) -> list[str]:
        # Get 20 candidates (cheap, fast)
        candidates = self.hybrid.search(query, k=20)
        
        # Rerank with cross-encoder (expensive, accurate)
        pairs = [(query, doc) for doc in candidates]
        scores = self.reranker.predict(pairs)
        
        # Return top k after reranking
        reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
        return [doc for doc, _ in reranked[:k]]


Why this works: Cross-encoders can’t pre-compute document embeddings. They need query and document together. That makes them slow for initial retrieval (you can’t score 10,000 documents). But for reranking 20 candidates to pick 3? Perfect trade-off.

In my testing on internal docs, reranking improved “correct chunk in top 3” from 68% to 89%. Retrieval was finding relevant chunks, just not ranking them first.


Worth noting: reranking won’t save bad retrieval. If the right chunk isn’t in your 20 candidates, reranking can’t surface it. Fix Levels 2 and 3 first.

Level 5: Production RAG

Everything above improves retrieval quality. Production RAG handles what happens when retrieval fails anyway.

Because it will fail. A user will ask something your documents don’t cover. Or your chunking will miss the relevant section. Or the question will be ambiguous enough that retrieved chunks conflict.


The question isn’t “how do I prevent retrieval failures?” It’s “what does my system do when retrieval fails?”

Guardrails


Don’t let the LLM improvise when it doesn’t have good context.


Air Canada learned this the hard way — they lost a lawsuit because their chatbot hallucinated a refund policy that didn’t exist (Full breakdown here).

def guarded_rag(query: str, retriever, min_score: float = 0.6) -> str:
    results = retriever.search_with_scores(query, k=3)
    
    # Check: Do we have ANY confident results?
    top_score = results[0][1] if results else 0
    if top_score < min_score:
        return (
            "I don't have enough information to answer that confidently. "
            "Could you rephrase, or is there a specific document I should look at?"
        )
    
    # Check: Are sources from different time periods?
    dates = [r["date"] for r, _ in results]
    date_warning = ""
    if len(set(dates)) > 1:
        newest = max(dates)
        if any(d < newest for d in dates):
            date_warning = "\n\n[Note: Some sources are older. The most recent policy takes precedence.]"
    
    # Generate with explicit grounding instruction
    context = "\n\n---\n\n".join([r["text"] for r, _ in results])
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {
                "role": "system",
                "content": f"""Answer based ONLY on the provided context.
If the context doesn't contain enough information, say so explicitly.
Never infer or make up information not directly stated.

Context:
{context}"""
            },
            {"role": "user", "content": query}
        ]
    )
    
    return response.choices[0].message.content + date_warning

Evaluation

You can’t improve what you can’t measure. Build a test set of queries with known-good answers.

test_cases = [
    {
        "query": "What's our data retention policy for customer records?",
        "must_retrieve": ["data-retention-policy-2024.md"],
        "answer_must_contain": ["7 years", "deletion request"],
        "answer_must_not_contain": ["2019", "employee retention"]
    },
    # ... 50+ more cases covering your actual use cases
]


Run this on every change. Track retrieval precision (did we get the right docs?) and answer accuracy (did the answer contain the right facts?). When a metric drops, you know exactly what broke.


Even with all this, you’ll get edge cases. Users phrase things in ways you didn’t anticipate. Documents contain contradictions you didn’t know about.


You’re going to miss edge cases. Make sure the system admits it instead of making things up.




Remember that legal team demo? The one where my RAG system confidently mixed outdated policy with irrelevant context?


They use the system every day now. It refuses to answer when it’s not confident. It flags when sources conflict or are outdated. It handles ambiguous queries by asking for clarification instead of guessing.


It took four rebuilds. Each one fixed a failure mode I didn’t anticipate the first time. That’s the process: build, break, understand why, level up.

Start at Level 1. Watch where it breaks. Move up only when you understand why it broke.


That’s how you build RAG systems that actually work.


Building something with RAG?

If you’re wiring retrieval into agents, my LangGraph guide covers the orchestration side: The Complete Guide to Building Your First AI Agent with LangGraph. It’s been read over 100K times and covers everything from basic graphs to production deployment.

Written by paoloap | No BS AI/ML Content | ML Engineer with a Plot Twist 🥷 70k+ Followers on LinkedIn
Published by HackerNoon on 2026/01/24