Spring AI RAG, Demystified: From Toy Demos to Production-Grade Retrieval

RAG went from research paper acronym to “every slide deck ever” in about a year. The pattern is simple on paper:

Embed documents → embed question → find similar chunks → feed to LLM.

But if you’ve tried building something non-trivial (say, a domain expert assistant or an internal knowledge bot), you already know the bad news:

Chunking is either too small (you lose context) or too big (you drag in noise)
Vector DB queries return “something vaguely related” instead of exactly what you need
The model makes stuff up when retrieval fails, but does it confidently
Every blog post explains the concept, almost none show how to tune the system

Spring AI actually ships a pretty solid set of primitives for doing RAG properly. The goal of this article is to walk through those pieces as an end-to-end workflow, and show where you can twist the knobs in real projects.

We’ll follow the real life-cycle of a RAG system:

Indexing / ETL – document ingestion, cleaning, chunking, metadata
Vectorization & storage – embeddings, vector DBs, batching
Retrieval – pre-retrieval query shaping, semantic search, filters, merging
Generation – query + context orchestration, error handling, and advisors
Tuning & advanced patterns – thresholds, chunk sizes, hybrid retrieval, and more

All examples are in Java/Spring, but the ideas carry over to any stack.

Step 1: ETL – Turning Messy Docs into AI-Native Knowledge

A lot of RAG failures are caused before the first token hits the LLM: the documents themselves are a mess.

Make “AI-native” documents first

If you’re serious about RAG, stop thinking of your PDFs as ground truth. Instead, think in terms of AI-native documents:

Clear hierarchical structure (headings, subheadings)
Long lists split into logical groups
Consistent terminology
Minimal noise: watermarks, giant tables, screenshots of code, etc.

You can absolutely use an LLM offline to normalize docs into a clean Markdown or HTML format before they ever hit Spring AI.

Spring AI’s document model

Spring AI wraps content as a Document:

content: the text that will eventually be embedded
metadata: a Map<String, Object> with arbitrary keys (tags like type, year, status, source…)
optional attachments (for multi-modal scenarios)

The ETL pipeline is built around three interfaces:

DocumentReader – Extract: load raw docs from somewhere
DocumentTransformer – Transform: split, enrich, format
DocumentWriter – Load: write into a vector DB, file, Redis, etc.

Extract: DocumentReader

Spring AI ships readers for JSON, text, Markdown, PDF, HTML, and more. For example, a JSON reader that pulls specific paths with JSON Pointer:

@Component
public class ProductJsonReader {

    private final Resource resource;

    public ProductJsonReader(@Value("classpath:products.json") Resource resource) {
        this.resource = resource;
    }

    public List<Document> read() {
        JsonDocumentReaderConfig config = JsonDocumentReaderConfig.builder()
            .withPointers("/products/*/description")
            .withAdditionalMetadata("source", "product-catalog")
            .build();

        JsonDocumentReader reader = new JsonDocumentReader(resource, config);
        return reader.get();
    }
}

The same pattern applies for Markdown, PDFs, emails, videos, GitHub docs, databases, etc. Each reader turns your “source world” into List<Document>.

Transform: chunking, enrichment, formatting

This is where most of the interesting tuning happens.

1. Chunking with TokenTextSplitter

TokenTextSplitter is the workhorse splitter based on token counts + simple heuristics (sentence boundaries, newlines, etc.). It’s a DocumentTransformer, so you can stack it with others:

@Component
public class SmartChunkTransformer {

    public List<Document> split(List<Document> docs) {
        // slightly smaller chunks than the defaults
        TokenTextSplitter splitter = new TokenTextSplitter(
            700,   // target tokens per chunk
            280,   // min chars before we try to break
            8,     // min chunk length to embed
            8000,  // max number of chunks
            true   // keep separators like newlines
        );
        return splitter.apply(docs);
    }
}

Play with chunk size per use-case:

Legal or design specs → longer chunks, you need intact reasoning blocks
Chatty Q&A or support docs → shorter chunks, you want sharp topical slices

2. Metadata enrichment with an LLM

Don’t rely only on raw text similarity. Use an LLM once during ingestion to extract semantic features into metadata, then filter on them cheaply at query time.

Spring AI has KeywordMetadataEnricher and SummaryMetadataEnricher, both built on a ChatModel:

@Component
public class KeywordEnricher {

    private final ChatModel chatModel;

    public KeywordEnricher(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    public List<Document> enrich(List<Document> docs) {
        KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel,
            List.of("keywords", "category"));
        return enricher.apply(docs);
    }
}

You can chain it right after chunking: split → add keywords → maybe add summaries.

3. DefaultContentFormatter – underused but powerful

DefaultContentFormatter lets you control exactly how metadata + content are stitched into the final string that goes into the embedding or the prompt.

Example:

DefaultContentFormatter formatter = DefaultContentFormatter.builder()
    .withMetadataTemplate("{key}: {value}")
    .withMetadataSeparator("
")
    .withTextTemplate("{metadata_string}

{content}")
    // Don’t leak embedding IDs into prompts
    .withExcludedInferenceMetadataKeys("embedding_id", "vector_id")
    .build();

Use it when you want the LLM to see things like:

type: love_advice
status: single
year: 2025

如何提升自己的恋爱吸引力？...

instead of a bare blob of text.

Load: DocumentWriter & ETL wiring

DocumentWriter is just Consumer<List<Document>>. There’s FileDocumentWriter (for plain files) and VectorStoreWriter (for embedding + vector DB).

A minimal ETL could look like this:

@Component
public class KnowledgeBaseIndexer {

    private final ProductJsonReader reader;
    private final SmartChunkTransformer splitter;
    private final KeywordEnricher enricher;
    private final VectorStore vectorStore;

    public KnowledgeBaseIndexer(ProductJsonReader reader,
                                SmartChunkTransformer splitter,
                                KeywordEnricher enricher,
                                VectorStore vectorStore) {
        this.reader = reader;
        this.splitter = splitter;
        this.enricher = enricher;
        this.vectorStore = vectorStore;
    }

    public void rebuildIndex() {
        List<Document> raw = reader.read();
        List<Document> chunks = splitter.split(raw);
        List<Document> enriched = enricher.enrich(chunks);
        vectorStore.add(enriched);
    }
}

This pipeline alone already puts you ahead of most “we just embedded everything once” demos.

Step 2: Vectorization & Storage – Choosing Your Retrieval Backbone

Once you have clean, chunked, enriched documents, you need a place to put them.

Spring AI’s VectorStore interface is intentionally simple:

public interface VectorStore extends DocumentWriter {

    void add(List<Document> documents);

    void delete(List<String> ids);

    void delete(FilterExpression filterExpression);

    List<Document> similaritySearch(SearchRequest request);

    default String getName() {
        return getClass().getSimpleName();
    }
}

The important bit for tuning is SearchRequest:

SearchRequest request = SearchRequest.builder()
    .query("How does Spring AI handle RAG?")
    .topK(5)
    .similarityThreshold(0.75)
    .filterExpression("category == 'spring-ai' && year >= '2024'")
    .build();

List<Document> docs = vectorStore.similaritySearch(request);

Which vector store?

Spring AI ships starters for many backends: in-memory, Redis, Elasticsearch, PGVector, Qdrant, etc. Spring AI adds cloud-native ones via DashScope (DashScopeCloudStore).

For backend-heavy Java shops, PGVector on PostgreSQL is incredibly pragmatic:

You probably already run Postgres
You keep relational + vector data in one place
No extra infrastructure team to convince

PGVector with Spring AI (hand-rolled config)

Instead of relying on auto-config, you can wire PgVectorStore yourself and pick exactly which EmbeddingModel you want:

<!-- pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<dependency>
    <groupId>org.postgresql</groupId>
    <artifactId>postgresql</artifactId>
    <scope>runtime</scope>
</dependency>
<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-pgvector-store</artifactId>
    <version>1.0.0-M7</version>
</dependency>

# application.yml
spring:
  datasource:
    url: jdbc:postgresql://YOUR_HOST:5432/rag_demo
    username: rag_user
    password: super_secret
  ai:
    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        # dimensions: 1536  # omit to let it derive from the embedding model

Now the config class:

@Configuration
public class PgVectorConfig {

    @Bean
    public VectorStore pgVectorStore(JdbcTemplate jdbcTemplate,
                                     @Qualifier("dashscopeEmbeddingModel")
                                     EmbeddingModel embeddingModel) {

        return PgVectorStore.builder(jdbcTemplate, embeddingModel)
            .dimensions(1536) // match your embedding model
            .distanceType(PgDistanceType.COSINE_DISTANCE)
            .build();
    }
}

A common gotcha: if you use multiple EmbeddingModel beans (e.g., Ollama + DashScope), make sure you qualify the one you actually want for this store.

BatchingStrategy – don’t blow up your embed API

Embedding thousands of chunks in a single call will eventually hit context window or rate limits. Spring AI’s BatchingStrategy lets you split documents into sane batches before embedding:

@Configuration
public class EmbeddingBatchConfig {

    @Bean
    public BatchingStrategy batchingStrategy() {
        return new TokenCountBatchingStrategy(
            EncodingType.CL100K_BASE,
            8192,  // max tokens per batch
            0.15   // leave some safety headroom
        );
    }
}

You can also implement your own BatchingStrategy if your vector DB has, for example, hard throughput limits and you want to throttle inserts explicitly.

Step 3: Retrieval – Query Shaping, Filters, and Result Merging

Once your store is populated, you still can’t just do similaritySearch(userText) and call it a day.

Spring AI breaks retrieval into pre-retrieval, retrieval, and post-retrieval stages.

Pre-retrieval: shaping the query

RewriteQueryTransformer – clean up messy user queries

Users don’t speak like search queries. RewriteQueryTransformer uses an LLM to rewrite a noisy query into something more explicit and model-friendly.

@Component
public class QueryRewriter {

    private final QueryTransformer transformer;

    public QueryRewriter(ChatModel chatModel) {
        ChatClient.Builder builder = ChatClient.builder(chatModel);
        this.transformer = RewriteQueryTransformer.builder()
            .chatClientBuilder(builder)
            .build();
    }

    public Query rewrite(String text) {
        return transformer.transform(new Query(text));
    }
}

Plug this into your RAG pipeline right before you call the retriever.

TranslationQueryTransformer – cross-language users, single-language embeddings

If your embedding model is English-only but your users speak Chinese, Spanish, etc., you can stick a TranslationQueryTransformer in front. It’s literally “LLM-as-translation-layer” – simple but not cheap. For production, many teams prefer a dedicated translation API + custom transformer.

CompressionQueryTransformer – distill long chat history

Multi-turn chats tend to accumulate context. CompressionQueryTransformer compresses history + latest user message into one standalone query. Perfect when you use conversation history but your vector search only sees the final “intent”.

Retrieval: VectorStoreDocumentRetriever

The main entry point for document retrieval is DocumentRetriever. For vector-based RAG, you usually use VectorStoreDocumentRetriever:

DocumentRetriever retriever = VectorStoreDocumentRetriever.builder()
    .vectorStore(vectorStore)
    .similarityThreshold(0.6)
    .topK(4)
    .filterExpression(new FilterExpressionBuilder()
        .eq("type", "love_advice")
        .eq("status", "single")
        .build())
    .build();

List<Document> docs = retriever.retrieve(new Query("怎么提升自己的吸引力？"));

Filters here are metadata filters, not semantic. This is why earlier enrichment pays off: you can reduce the search space to just “love_advice + single” before even running similarity.

You can also pass the filter expression dynamically via Query.context if you need per-request logic.

Document merging: ConcatenationDocumentJoiner

If you use multiple retrievers (multi-query, multi-source, hybrid search), you’ll end up with many document lists. ConcatenationDocumentJoiner deduplicates and flattens them:

Map<Query, List<List<Document>>> docsPerQuery = ...;

DocumentJoiner joiner = new ConcatenationDocumentJoiner();
List<Document> merged = joiner.join(docsPerQuery);

Under the hood it’s basically:

Flatten Map<Query, List<List<Document>>> into a single list
Deduplicate by document ID
Preserve scores

It’s simple but exactly what you want right before handing everything to the generation stage.

Step 4: Generation – Advisors, Context, and Error Handling

Spring AI’s RAG story really becomes ergonomic when you wire it into ChatClient advisors.

QuestionAnswerAdvisor – good default, minimal ceremony

QuestionAnswerAdvisor is the fastest way to get RAG working:

Advisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
    .searchRequest(SearchRequest.builder()
        .similarityThreshold(0.7)
        .topK(5)
        .build())
    .build();

String answer = chatClient.prompt()
    .user("如何在三个月内提升社交魅力？")
    .advisors(qaAdvisor)
    .call()
    .content();

The advisor:

Takes the user message
Runs a vector search
Stitches docs + question into a prompt
Calls the model

You can override the prompt template if you want strict instructions (“only answer using the context above, otherwise say you don’t know”).

RetrievalAugmentationAdvisor – full modular RAG graph

For more control you switch to RetrievalAugmentationAdvisor. It lets you explicitly plug in:

query transformers (rewrite, compress, translate)
document retriever (vector, hybrid, custom)
query augmenter (how to inject context into the final prompt)

Example with a query rewriter + vector retriever:

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
    .queryTransformers(RewriteQueryTransformer.builder()
        .chatClientBuilder(ChatClient.builder(chatModel))
        .build())
    .documentRetriever(VectorStoreDocumentRetriever.builder()
        .vectorStore(vectorStore)
        .similarityThreshold(0.55)
        .topK(6)
        .build())
    .build();

Then:

String reply = chatClient.prompt()
    .user("what's the advice of career？")
    .advisors(ragAdvisor)
    .call()
    .content();

ContextualQueryAugmenter – what to do when retrieval finds nothing

By default, RetrievalAugmentationAdvisor is conservative. If retrieval returns no docs, it swaps your user query with a “out of knowledge base, please refuse” prompt.

You can customize this using ContextualQueryAugmenter:

PromptTemplate emptyContextTemplate = new PromptTemplate("""
You are a relationship advice assistant.
The current question is outside your knowledge base. 
Please respond briefly and politely in English, telling the user:
You can only answer relationship-related questions, 
and invite them to describe their situation more specifically.
""");

ContextualQueryAugmenter augmenter = ContextualQueryAugmenter.builder()
    .allowEmptyContext(true) // keep original question even when no docs found
    .emptyContextPromptTemplate(emptyContextTemplate)
    .build();

Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder() .documentRetriever(retriever) .queryAugmenter(augmenter) .build();

This gives you graceful degradation instead of raw hallucination.

### A custom advisor factory for a “love coach” bot

Putting it all together, you can hide the complexity behind a small factory:

```java
public final class LoveCoachAdvisorFactory {

    private LoveCoachAdvisorFactory() {}

    public static Advisor forStatus(VectorStore store, String status) {
        DocumentRetriever retriever = VectorStoreDocumentRetriever.builder()
            .vectorStore(store)
            .similarityThreshold(0.55)
            .topK(4)
            .filterExpression(new FilterExpressionBuilder()
                .eq("type", "love_advice")
                .eq("status", status)
                .build())
            .build();

        ContextualQueryAugmenter augmenter =
            LoveAppContextualQueryAugmenterFactory.createInstance();

        return RetrievalAugmentationAdvisor.builder()
            .documentRetriever(retriever)
            .queryAugmenter(augmenter)
            .build();
    }
}

Your chat layer doesn’t need to know anything about vector stores anymore; it just picks an advisor based on the user profile.

Step 5: Tuning Playbook – Making RAG Not Suck

Now to the parts you usually end up rediscovering the hard way.

1. Document strategy first, everything else later

If your knowledge base is incomplete or badly structured, no amount of thresholds or LLM trickery will save you.

Checklist:

Do we actually have all the content users will ask about?
Are docs written in the same language users will query with?
Are critical terms consistently named?
Did we strip junk (watermarks, repeated headers, screenshot-only pages)?

When in doubt, run offline retrieval tests: generate 50–100 realistic questions and see what the retriever actually surfaces.

2. Chunking: avoid both over- and under-splitting

Bad chunking shows up as:

“Half-answer” – the retrieved chunk hints at the right area but cuts off the crucial line
“Wall of text” – the chunk is huge, includes multiple topics, and confuses the model

Practical patterns:

Start with something like 500–800 tokens per chunk + overlaps only if necessary
Prefer semantic or structural boundaries (headings, sentences) to “every N tokens”
Let cloud vendors’ “smart split” do the heavy lifting when available, but still spot-check

If you use Cloud Model Studio, enabling intelligent chunking on the knowledge base will apply a similar strategy: first split by sentence markers, then adapt chunk boundaries by semantic coherence instead of length alone. You still need to manually fix any obvious mis-splits in the console.

3. Metadata: design it like an index, not an afterthought

Good metadata makes filtering trivial:

type: faq, tutorial, policy, love_advice, interior_design...
status: single, dating, married, divorced...
year / version: so you can prefer newer material
source: doc name, system, product

Implement metadata as close to the source of truth as possible (e.g. in your CMS or docs repo), then enrich with AI-only fields such as keywords or summary during ingestion.

4. Tuning similarityThreshold and topK

This is where a lot of “RAG feels off” comes from.

If results feel incomplete (missing relevant chunks) → lower similarityThreshold and/or raise topK
If results feel noisy (irrelevant chunks) → raise similarityThreshold and/or lower topK

Rules of thumb:

Start around similarityThreshold = 0.6–0.7, topK = 4–6
For narrow, highly structured domains, go higher (0.75+)
For very heterogeneous, noisy docs, go lower but compensate with stronger metadata filters

Always test with a fixed set of labeled queries so you can see whether tuning helps or hurts.

5. Hallucination and refusal behavior

Even with perfect retrieval you’ll get edge cases. Mitigation options:

In the RAG prompt, explicitly instruct: “If the context doesn’t contain enough information to answer, say you don’t know.”
Use ContextualQueryAugmenter with a custom empty-context template that refuses out-of-domain questions gracefully
For high-risk domains (legal/medical), consider separating general LLM chat from knowledge-base-backed Q&A and forcing a refusal when retrieval is empty or very low-score

6. Multi-query expansion – use, but don’t abuse

MultiQueryExpander can boost recall by generating paraphrased queries, but:

Every extra query costs more retrieval + more LLM time
It’s often overkill if your docs and metadata are well-structured

If you use it, limit to 3–5 variants, deduplicate aggressively, and monitor cost and latency.

Step 6: Beyond Basics – Hybrid Retrieval & Higher-Level Architectures

Once the basics are solid, you can start layering in more advanced patterns.

Hybrid retrieval: vector + lexical + structured

No single retrieval method is perfect:

Vector search understands meaning, but can miss strict keyword or year constraints
Full-text / lexical search (think Elasticsearch) nails exact matches and numeric filters, but doesn’t get synonyms
Structured search via metadata is cheap and precise, but only as good as your schema

A robust system typically combines them:

Filter by metadata (type, status, year)
Run vector search on the reduced candidate set
Optionally mix in keyword search for exact matches on IDs, names, etc.
Merge and re-rank

Spring AI doesn’t force a single pattern—DocumentRetriever is just an interface. You can write your own “hybrid retriever” that fans out to both vector store and, say, Elasticsearch, then uses ConcatenationDocumentJoiner + custom scoring.

Re-ranking and multi-stage pipelines

For large corpora, you may want a two-stage retrieval:

Fast, approximate search (HNSW index, low-dimensional embeddings)
Slow, precise re-ranking with a cross-encoder model (e.g., a reranker that scores each (query, chunk) pair)

The first stage optimizes recall, the second optimizes precision. Spring AI’s modular design makes it straightforward to put the reranker into the post-retrieval step before sending docs to the LLM.

Architecture patterns: central knowledge service

At system level, a neat pattern is to isolate RAG into its own service:

A knowledge service owns ETL, vector stores, filters, and advisors
Other microservices call it via a simple API: answerQuestion(userId, query)
The knowledge service picks the right advisor (e.g., based on tenant, domain, user status)

This gives you:

One place to tune thresholds and test recall
One place to monitor cost and latency
One place to upgrade embedding models or vector DBs

Wrap-up

RAG is not just “add embeddings, stir with LLM, ship.” It’s a pipeline:

Document strategy decides your ceiling
ETL decides whether you actually reach that ceiling
Vector store + retrieval decide how much of the right stuff you bring back
Advisors + prompts decide how safely your model uses that stuff

Spring AI gives Java developers real building blocks instead of a giant black box: DocumentReader / DocumentTransformer / DocumentWriter, VectorStore, DocumentRetriever, Advisor, and a bunch of utilities around them.

If you treat these as serious, tunable components instead of “just configuration,” your RAG system will stop feeling like a fragile demo and start feeling like a real product.