The promise was elegant. Feed your LLM a curated stream of external documents—invoices, support tickets, internal wikis—and watch it ground hallucinations in fact. Retrieval-Augmented Generation would transform the fuzzy oracle into something you could actually deploy in production. Except it doesn't work that way. Not really.
I've watched RAG pipelines fail in three distinct modes, and rarely is the model itself the culprit. The failure surface lives upstream: in the corpus, the chunking heuristic, the index that hasn't been rebuilt since Q2, the embedding model someone picked because it was fastest to onboard. When a RAG system gives you garbage, the garbage is usually entered at ingest or retrieval. The LLM is just the messenger.
What we mean when we say "retrieval."
RAG grafts a single intelligent layer—the generator—onto what is fundamentally a search problem. You index documents (PDFs, Markdown, scraped HTML), convert them to vectors, store those vectors in Pinecone or Weaviate or a Postgres extension, then at query time retrieve the top-k semantically similar chunks and inject them into the prompt. The model synthesizes an answer from that context window.
Simple on a whiteboard. Fragile in practice.
Because retrieval is not a solved problem. It's the same challenge Google spent two decades refining: given a query, surface the right documents. Except now you're doing it with 512-token chunks of your internal Confluence, embedded with a model that has never seen your domain vocabulary, ranked by cosine similarity in a 768-dimensional space you can't visualize. And if the retriever misses the relevant chunk—maybe it's worded differently, maybe it got split mid-sentence, maybe it's three versions stale—the LLM hallucinates with confidence. It has to. It doesn't know what it doesn't have.
Where it fractures
Chunking is where good intent meets reality. You can't feed an entire 200-page manual into a context window, so you split it. But how? Sentence boundaries? Paragraph breaks? Fixed 500-token windows with 10% overlap? I've seen pipelines that chunk on double newlines, which works until someone pastes a bulleted list. I've seen semantic chunking that tries to detect topic shifts, which works until the detector confuses a subordinate clause for a transition.
Here's the thing: bad chunking orphans meaning. A compliance answer might span two chunks, and if you retrieve chunk A but not chunk B, you get half a rule. The model doesn't know there's a continuation. It completes the thought with plausible-sounding nonsense, and your user accepts it because it's formatted like truth.
Metadata is the other half of the equation. You need timestamps, authors, document types, version hashes—anything that lets you filter before the semantic search happens. Otherwise you're ranking a 2019 policy document equal to this morning's Slack thread, and the old one might rank higher because it's longer, more formal, denser with keywords. I once debugged a system that kept citing a deprecated API endpoint because the old docs had better SEO juice in embedding space. The current docs were terse. Terser is often correct, but embeddings don't know that.
Stale indexes are the silent killer. Your corpus changes—new support articles, updated schemas, deleted obsolete guides—but the vector index doesn't rebuild automatically. Not unless you've engineered that loop. I've seen production systems running on three-month-old embeddings, answering questions about a feature that shipped two weeks ago with "that's not available yet" because the ingestion cron is broken and nobody's monitoring it. The model is working perfectly. The pipeline is not.
The illusion of intelligence
RAG feels like AI because the final step is generative. The model writes fluent prose, and fluency is persuasive. But fluency is not correctness. If the retriever handed it four irrelevant chunks and one barely relevant one, the model does its best. It extrapolates. It smooths over gaps. It produces an answer that sounds authoritative because that's what it's been trained to do.
This is why I treat RAG like a search engine first and a language model second. When recall drops—when relevant documents stop surfacing in the top-5—hallucination risk spikes. You can't monitor this by reading model outputs. You need retrieval metrics: precision at k, mean reciprocal rank, coverage over known question-document pairs. If your click-through rate (or in this case, retrieval-to-answer coherence) is falling, the model's confidence score won't tell you. The search layer is degrading quietly.
What you fix on Monday morning
Start with the corpus. Is it complete? I mean actually complete, not "we onboarded the public docs" complete. Are there internal runbooks, email threads, or Slack archives that contain ground truth? If the knowledge doesn't exist in the index, RAG can't retrieve it. This is not an AI problem. This is a knowledge management problem, which means it's political and organizational and boring, but it's also the highest-leverage fix.
Then audit your chunking. Take ten queries you care about, find the documents that should answer them, and manually inspect which chunks get created. Are the relevant sentences split across chunks? Are there orphaned bullet points? Is critical context in a header that got chunked separately from the body? You might need overlapping windows. You might need to preserve section headers as metadata. You might need to chunk semantically, using a small model to detect topic boundaries. There's no universal answer. It depends on your document structure.
Next: embedding drift. If you're using text-embedding-ada-002 because it was the default when you started, consider whether a domain-tuned model (or even a newer general model) would improve semantic matching. Measure this. Take a eval set of query-document pairs, swap in the new embedder, re-rank, and check if recall improves. If it doesn't, don't switch. If it does, the re-embedding job is probably worth the infra cost.
Filters matter more than people think. If you can tag documents by category, recency, author trust level, or internal-vs-external, you can shrink the search space before semantic ranking happens. A query about "current pricing" should filter to docs.created_after(today - 90 days) before it ever hits the vector store. A question about internal process should exclude public marketing pages. These are crude heuristics, but they prevent the model from wasting context window on wrong-but-confident answers.
Finally: refresh cadence. How often does your index rebuild? Daily? Weekly? On-demand when a doc changes? If you're doing batch re-indexing, you're accumulating staleness debt. If you're doing event-driven incremental updates, you need tombstones for deletions and versioning for edits. I've shipped both. The incremental approach is harder to build but better in production, assuming you actually log and monitor the update queue. If you don't, you'll end up with phantom documents—deleted from the source, still live in the index—and no easy way to debug which answer came from where.
The model is the last 10%
This isn't an argument against LLMs. Generation quality matters. A stronger model can synthesize better answers from mediocre chunks, can handle ambiguity, can refuse when context is insufficient. But I've seen teams throw GPT-4 Turbo at a RAG pipeline and still get poor results because the retriever is handing it irrelevant context. I've also seen teams get step-function improvements by switching from naive keyword search to even a modest embedding model, no change to the generator.
The ceiling is low when the data is bad. If your corpus is incomplete, no amount of prompt engineering will summon facts that aren't there. If your chunks are malformed, the model will stitch together fragments into plausible-sounding lies. If your index is stale, you'll answer yesterday's questions with confidence today.
RAG is a search problem with a generative interface. Treat it as such. Monitor retrieval quality. Version your embeddings. Rebuild your index before it rots. And when the system fails—not if, when—check the data pipeline first. The model is probably fine. The model is almost always fine.
