RAG Is a Data Problem Pretending to Be AI

There's a particular kind of post-mortem that happens inside AI teams around month three of a RAG deployment. The model is GPT-4 or Claude or something equally capable. The embeddings came from a well-regarded provider. The vector store is the one with the best benchmark numbers. And yet — users are getting confidently wrong answers, support tickets are multiplying, and someone in a leadership meeting has started using the phrase "hallucination problem."

It isn't a hallucination problem. It was never a hallucination problem.

What it is, almost invariably, is a retrieval problem dressed in a lab coat.

RAG — Retrieval-Augmented Generation — operates on a deceptively simple premise: instead of asking a language model to recall facts from parametric memory, you retrieve relevant documents at inference time and hand them to the model as context. The model's job narrows: read what you're given, synthesize an answer. This should, in theory, make the system more grounded, more accurate, more auditable. And it does — when the retrieval works. The brutal part is that retrieval is far harder to get right than generation, and most teams discover this in the wrong order.

The marketing around these systems focuses on the language model because the language model is the impressive thing. It speaks fluently, reasons visibly, and adapts to tone. The retrieval pipeline — the chunking, the embedding, the indexing, the ranking — is infrastructure. Boring infrastructure. The kind that doesn't get demo'd at conferences because watching a cosine similarity score update in real time is nobody's idea of a keynote moment. But that infrastructure is where RAG either earns its keep or quietly collapses.

Start with chunking, because chunking is where most systems first go wrong, and it's where the wrongness is most invisible.

When you ingest a document corpus, you have to break it into pieces. The embedding model has a context window — typically somewhere between 512 and 8,192 tokens, depending on the model — and you need to fit retrievable units inside that window. The naive approach, and the one that ships in most quickstart tutorials, is fixed-size chunking: split every 512 tokens, maybe with a 50-token overlap, call it done. This is fast. It is also, in many domains, genuinely terrible.

Consider a technical specification document. The relevant claim — say, the maximum operating temperature of a component — might appear in sentence eight of a paragraph. Sentence seven establishes the testing conditions. Sentence nine qualifies the measurement. A 512-token chunk starting at sentence five and ending at sentence twelve probably captures all three.

But if the chunk boundary falls between sentences seven and eight, you've separated the claim from its operating conditions. The embedding of the truncated chunk will be semantically close enough to a query about operating temperature to get retrieved. The model then reads a fragment that says the maximum temperature is 85°C, without understanding that this figure applies only under a specific load condition, and it generates a confident, complete, wrong answer.

This is not a hallucination. This is a retrieval artifact. The model read exactly what you gave it and did exactly what you asked.

Semantic chunking — where splits are determined by topic boundaries rather than token counts — helps, but it's slower and requires either a secondary model call or heuristics that are themselves wrong in edge cases. Hierarchical chunking, where you maintain parent-child relationships between document sections and chunks, helps more. The systems that actually perform well in production tend to use a combination: semantic splits for structure-heavy documents, smaller dense chunks for factual reference material, and explicit metadata fields that let the retriever filter before it ranks. None of this is easy to implement. All of it is worth the effort.

The embedding problem is subtler and, honestly, more insidious, because it tends to manifest as a vague, hard-to-reproduce quality degradation rather than an obvious failure mode.

Embedding models are trained on corpora with particular statistical properties. When you embed a query against a corpus, you're relying on the assumption that the model's learned representations of semantic similarity match the similarity that matters for your use case. That assumption is often wrong in domain-specific contexts. A general-purpose embedding model trained on web text will have learned that "cardiac arrest" and "heart attack" are similar, but it may not have learned that, in a specific clinical protocol document, "arrest" has a narrower technical definition than "attack," and that the distinction carries clinical significance. The retrieval will surface documents that look semantically relevant to a layperson's query but are subtly wrong for the clinical context.

Domain-specific fine-tuning of embedding models addresses this, but most teams don't do it because the data collection, fine-tuning pipeline, and evaluation harness are all non-trivial investments. The shortcut — and it is a shortcut — is retrieval augmented with a reranking step. Retrieve a larger candidate set using the general-purpose embeddings, then pass those candidates through a cross-encoder reranker that scores query-document relevance more accurately by attending to both simultaneously rather than encoding them independently. The cross-encoder is slower and can't be used for the initial retrieval (you can't run it against a million documents), but over a candidate set of 20–50 documents, it's fast enough and meaningfully more accurate. This two-stage architecture is the current practical optimum for most production systems — not elegant, but it works.

Stale embeddings are a different flavor of the same problem. Your corpus changes. Documents get updated. New information supersedes old. If your embedding index isn't kept current, the retriever is operating on a representation of the world that no longer exists. The model reads stale context and generates an answer that was accurate six months ago. This is particularly punishing in domains like financial regulation, drug interactions, or software versioning, where the delta between last month's truth and this month's truth can be both large and consequential.

The fix is incremental indexing with invalidation — keeping track of document modification timestamps, flagging changed or deleted documents for re-embedding, running differential index updates rather than full rebuilds. Again: not glamorous. Pure data engineering. And genuinely necessary.

Ranking deserves its own examination because it's the step that most directly determines what the model actually reads, and most systems treat it as an afterthought.

The default retrieval mechanism in virtually every vector store tutorial is approximate nearest neighbor search over embeddings — find the k vectors closest to the query vector in the embedding space. This is useful, but it captures semantic similarity, not relevance in the sense that matters for answering a specific question. A document might be semantically close to a query while containing information that's tangential, outdated, or scoped to a different population than the one the user is asking about.

Hybrid retrieval — combining dense retrieval (embeddings) with sparse retrieval (BM25 or similar term-frequency-based methods) — often outperforms either alone, particularly for queries that include rare terms, proper nouns, or domain-specific jargon that the embedding model may have underrepresented in its learned space. The combination is not guaranteed to be better in every case. There are domains and query distributions where pure dense retrieval wins. But the general guidance, supported by the literature and by practical experience, is to default to hybrid and measure from there.

What you should measure, specifically, is retrieval precision and recall before you touch generation quality metrics. This is where many teams make a critical evaluation mistake: they measure end-to-end answer quality — does the final generated answer match the ground truth? — without decomposing the failure into retrieval failure versus generation failure. If retrieval is returning the right documents 40% of the time, improving the prompt won't fix the remaining 60%. But if you're not instrumenting the retrieval layer independently, you won't know that. You'll just know the answers are wrong, and you'll keep tweaking the prompt.

Instrument everything. Log which documents were retrieved for each query. Log the relevance scores. Build an evaluation set that includes expected retrievals, not just expected answers. This is the only way to know whether you have a retrieval problem or a generation problem, and getting that distinction right is the difference between spending three weeks on prompt engineering that doesn't move the needle versus three days refactoring your chunking strategy that does.

Metadata is the piece that the literature mentions in passing, and practitioners learn to respect through pain.

Documents don't exist in isolation. They have provenance — when they were written, by whom, under what authority, for what audience. They have scope — this document applies to product version 3.2, this one to the European regulatory regime, this one to enterprise customers only. They have recency — this is the current specification, that's the deprecated one. When none of this metadata is surfaced in the retrieval layer, the vector store treats all documents as equally plausible sources for any query. A user asking about the current return policy might retrieve a document from 2019. A user asking about a feature in version 4.0 might retrieve a document scoped to version 2.x.

The solution isn't complicated — attach structured metadata to every chunk at ingestion time, expose that metadata as filterable attributes in the retrieval layer, and build query routing logic that applies the appropriate filters before semantic search. What's hard is the data work upstream: standardizing metadata across document sources, establishing taxonomies, building pipelines that correctly propagate document-level metadata to the chunk level. This is, again, data engineering. It's also exactly the kind of work that determines whether a RAG system is actually useful or merely impressive in demos.

A careful builder — someone who's been burned by a few of these failure modes — would walk into Monday morning with a specific checklist.

First: audit the chunking strategy against the actual document structure of the corpus. Are there semantic boundaries being crossed? Are claims being separated from their qualifying conditions? Is there enough overlap to handle context that spans chunk boundaries?

Second: profile the retrieval layer independently. Build a small evaluation set of 50–100 query-document pairs where the expected retrieved document is known. Measure recall@k. If recall@5 is below 0.7, the retrieval problem is severe enough that nothing downstream will save you.

Third: add a reranking stage if there isn't one. Cohere, cross-encoders from the Sentence Transformers library, or even a secondary LLM call scoring candidate relevance — any of these will outperform naive cosine similarity ranking for most non-trivial corpora.

Fourth: check the freshness of the index. When was it last updated? Are there mechanisms for incremental re-embedding of changed documents, or is this a full-rebuild-every-week situation? The answer shapes how stale your system's world-model is, and therefore how much you can trust its answers on time-sensitive topics.

Fifth: look at the metadata schema and ask honestly whether the retrieval layer is using it or ignoring it. If every query hits the full corpus regardless of scope, version, or recency filters, you're relying on the embedding model to figure out relevance that you could have enforced structurally.

None of this is AI. All of it is necessary.

The framing of RAG as an AI problem is understandable — AI is where the interest is, where the funding is, where the narrative gravity pulls. But it's a framing that leads teams to optimize the wrong layer. The language model in a RAG system is, most of the time, not the bottleneck. It will faithfully synthesize whatever you hand it. The question is whether what you hand it is any good.

Build the retrieval pipeline like it matters. It does.