RAG went from research paper acronym to “every slide deck ever” in about a year. The pattern is simple on paper:
Embed documents → embed question → find similar chunks → feed to LLM.
But if you’ve tried building something non-trivial (say, a domain expert assistant or an internal knowledge bot), you already know the bad news:
- Chunking is either too small (you lose context) or too big (you drag in noise)
- Vector DB queries return “something vaguely related” instead of exactly what you need
- The model makes stuff up when retrieval fails, but does it confidently
- Every blog post explains the concept, almost none show how to tune the system
Spring AI actually ships a pretty solid set of primitives for doing RAG properly. The goal of this article is to walk through those pieces as an end-to-end workflow, and show where you can twist the knobs in real projects.
We’ll follow the real life-cycle of a RAG system:
- Indexing / ETL – document ingestion, cleaning, chunking, metadata
- Vectorization & storage – embeddings, vector DBs, batching
- Retrieval – pre-retrieval query shaping, semantic search, filters, merging
- Generation – query + context orchestration, error handling, and advisors
- Tuning & advanced patterns – thresholds, chunk sizes, hybrid retrieval, and more
All examples are in Java/Spring, but the ideas carry over to any stack.
Step 1: ETL – Turning Messy Docs into AI-Native Knowledge
A lot of RAG failures are caused before the first token hits the LLM: the documents themselves are a mess.
Make “AI-native” documents first
If you’re serious about RAG, stop thinking of your PDFs as ground truth. Instead, think in terms of AI-native documents:
- Clear hierarchical structure (headings, subheadings)
- Long lists split into logical groups
- Consistent terminology
- Minimal noise: watermarks, giant tables, screenshots of code, etc.
You can absolutely use an LLM offline to normalize docs into a clean Markdown or HTML format before they ever hit Spring AI.
Spring AI’s document model
Spring AI wraps content as a Document:
content: the text that will eventually be embeddedmetadata: aMap<String, Object>with arbitrary keys (tags liketype,year,status,source…)- optional attachments (for multi-modal scenarios)
The ETL pipeline is built around three interfaces:
DocumentReader– Extract: load raw docs from somewhereDocumentTransformer– Transform: split, enrich, formatDocumentWriter– Load: write into a vector DB, file, Redis, etc.
Extract: DocumentReader
Spring AI ships readers for JSON, text, Markdown, PDF, HTML, and more. For example, a JSON reader that pulls specific paths with JSON Pointer:
@Component
public class ProductJsonReader {
private final Resource resource;
public ProductJsonReader(@Value("classpath:products.json") Resource resource) {
this.resource = resource;
}
public List<Document> read() {
JsonDocumentReaderConfig config = JsonDocumentReaderConfig.builder()
.withPointers("/products/*/description")
.withAdditionalMetadata("source", "product-catalog")
.build();
JsonDocumentReader reader = new JsonDocumentReader(resource, config);
return reader.get();
}
}
The same pattern applies for Markdown, PDFs, emails, videos, GitHub docs, databases, etc. Each reader turns your “source world” into List<Document>.
Transform: chunking, enrichment, formatting
This is where most of the interesting tuning happens.
1. Chunking with TokenTextSplitter
TokenTextSplitter is the workhorse splitter based on token counts + simple heuristics (sentence boundaries, newlines, etc.). It’s a DocumentTransformer, so you can stack it with others:
@Component
public class SmartChunkTransformer {
public List<Document> split(List<Document> docs) {
// slightly smaller chunks than the defaults
TokenTextSplitter splitter = new TokenTextSplitter(
700, // target tokens per chunk
280, // min chars before we try to break
8, // min chunk length to embed
8000, // max number of chunks
true // keep separators like newlines
);
return splitter.apply(docs);
}
}
Play with chunk size per use-case:
- Legal or design specs → longer chunks, you need intact reasoning blocks
- Chatty Q&A or support docs → shorter chunks, you want sharp topical slices
2. Metadata enrichment with an LLM
Don’t rely only on raw text similarity. Use an LLM once during ingestion to extract semantic features into metadata, then filter on them cheaply at query time.
Spring AI has KeywordMetadataEnricher and SummaryMetadataEnricher, both built on a ChatModel:
@Component
public class KeywordEnricher {
private final ChatModel chatModel;
public KeywordEnricher(ChatModel chatModel) {
this.chatModel = chatModel;
}
public List<Document> enrich(List<Document> docs) {
KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel,
List.of("keywords", "category"));
return enricher.apply(docs);
}
}
You can chain it right after chunking: split → add keywords → maybe add summaries.
3. DefaultContentFormatter – underused but powerful
DefaultContentFormatter lets you control exactly how metadata + content are stitched into the final string that goes into the embedding or the prompt.
Example:
DefaultContentFormatter formatter = DefaultContentFormatter.builder()
.withMetadataTemplate("{key}: {value}")
.withMetadataSeparator("
")
.withTextTemplate("{metadata_string}
{content}")
// Don’t leak embedding IDs into prompts
.withExcludedInferenceMetadataKeys("embedding_id", "vector_id")
.build();
Use it when you want the LLM to see things like:
type: love_advice
status: single
year: 2025
如何提升自己的恋爱吸引力?...
instead of a bare blob of text.
Load: DocumentWriter & ETL wiring
DocumentWriter is just Consumer<List<Document>>. There’s FileDocumentWriter (for plain files) and VectorStoreWriter (for embedding + vector DB).
A minimal ETL could look like this:
@Component
public class KnowledgeBaseIndexer {
private final ProductJsonReader reader;
private final SmartChunkTransformer splitter;
private final KeywordEnricher enricher;
private final VectorStore vectorStore;
public KnowledgeBaseIndexer(ProductJsonReader reader,
SmartChunkTransformer splitter,
KeywordEnricher enricher,
VectorStore vectorStore) {
this.reader = reader;
this.splitter = splitter;
this.enricher = enricher;
this.vectorStore = vectorStore;
}
public void rebuildIndex() {
List<Document> raw = reader.read();
List<Document> chunks = splitter.split(raw);
List<Document> enriched = enricher.enrich(chunks);
vectorStore.add(enriched);
}
}
This pipeline alone already puts you ahead of most “we just embedded everything once” demos.
Step 2: Vectorization & Storage – Choosing Your Retrieval Backbone
Once you have clean, chunked, enriched documents, you need a place to put them.
Spring AI’s VectorStore interface is intentionally simple:
public interface VectorStore extends DocumentWriter {
void add(List<Document> documents);
void delete(List<String> ids);
void delete(FilterExpression filterExpression);
List<Document> similaritySearch(SearchRequest request);
default String getName() {
return getClass().getSimpleName();
}
}
The important bit for tuning is SearchRequest:
SearchRequest request = SearchRequest.builder()
.query("How does Spring AI handle RAG?")
.topK(5)
.similarityThreshold(0.75)
.filterExpression("category == 'spring-ai' && year >= '2024'")
.build();
List<Document> docs = vectorStore.similaritySearch(request);
Which vector store?
Spring AI ships starters for many backends: in-memory, Redis, Elasticsearch, PGVector, Qdrant, etc. Spring AI adds cloud-native ones via DashScope (DashScopeCloudStore).
For backend-heavy Java shops, PGVector on PostgreSQL is incredibly pragmatic:
- You probably already run Postgres
- You keep relational + vector data in one place
- No extra infrastructure team to convince
PGVector with Spring AI (hand-rolled config)
Instead of relying on auto-config, you can wire PgVectorStore yourself and pick exactly which EmbeddingModel you want:
<!-- pom.xml -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-jdbc</artifactId>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pgvector-store</artifactId>
<version>1.0.0-M7</version>
</dependency>
# application.yml
spring:
datasource:
url: jdbc:postgresql://YOUR_HOST:5432/rag_demo
username: rag_user
password: super_secret
ai:
vectorstore:
pgvector:
index-type: HNSW
distance-type: COSINE_DISTANCE
# dimensions: 1536 # omit to let it derive from the embedding model
Now the config class:
@Configuration
public class PgVectorConfig {
@Bean
public VectorStore pgVectorStore(JdbcTemplate jdbcTemplate,
@Qualifier("dashscopeEmbeddingModel")
EmbeddingModel embeddingModel) {
return PgVectorStore.builder(jdbcTemplate, embeddingModel)
.dimensions(1536) // match your embedding model
.distanceType(PgDistanceType.COSINE_DISTANCE)
.build();
}
}
A common gotcha: if you use multiple EmbeddingModel beans (e.g., Ollama + DashScope), make sure you qualify the one you actually want for this store.
BatchingStrategy – don’t blow up your embed API
Embedding thousands of chunks in a single call will eventually hit context window or rate limits. Spring AI’s BatchingStrategy lets you split documents into sane batches before embedding:
@Configuration
public class EmbeddingBatchConfig {
@Bean
public BatchingStrategy batchingStrategy() {
return new TokenCountBatchingStrategy(
EncodingType.CL100K_BASE,
8192, // max tokens per batch
0.15 // leave some safety headroom
);
}
}
You can also implement your own BatchingStrategy if your vector DB has, for example, hard throughput limits and you want to throttle inserts explicitly.
Step 3: Retrieval – Query Shaping, Filters, and Result Merging
Once your store is populated, you still can’t just do similaritySearch(userText) and call it a day.
Spring AI breaks retrieval into pre-retrieval, retrieval, and post-retrieval stages.
Pre-retrieval: shaping the query
RewriteQueryTransformer – clean up messy user queries
Users don’t speak like search queries. RewriteQueryTransformer uses an LLM to rewrite a noisy query into something more explicit and model-friendly.
@Component
public class QueryRewriter {
private final QueryTransformer transformer;
public QueryRewriter(ChatModel chatModel) {
ChatClient.Builder builder = ChatClient.builder(chatModel);
this.transformer = RewriteQueryTransformer.builder()
.chatClientBuilder(builder)
.build();
}
public Query rewrite(String text) {
return transformer.transform(new Query(text));
}
}
Plug this into your RAG pipeline right before you call the retriever.
TranslationQueryTransformer – cross-language users, single-language embeddings
If your embedding model is English-only but your users speak Chinese, Spanish, etc., you can stick a TranslationQueryTransformer in front. It’s literally “LLM-as-translation-layer” – simple but not cheap. For production, many teams prefer a dedicated translation API + custom transformer.
CompressionQueryTransformer – distill long chat history
Multi-turn chats tend to accumulate context. CompressionQueryTransformer compresses history + latest user message into one standalone query. Perfect when you use conversation history but your vector search only sees the final “intent”.
Retrieval: VectorStoreDocumentRetriever
The main entry point for document retrieval is DocumentRetriever. For vector-based RAG, you usually use VectorStoreDocumentRetriever:
DocumentRetriever retriever = VectorStoreDocumentRetriever.builder()
.vectorStore(vectorStore)
.similarityThreshold(0.6)
.topK(4)
.filterExpression(new FilterExpressionBuilder()
.eq("type", "love_advice")
.eq("status", "single")
.build())
.build();
List<Document> docs = retriever.retrieve(new Query("怎么提升自己的吸引力?"));
Filters here are metadata filters, not semantic. This is why earlier enrichment pays off: you can reduce the search space to just “love_advice + single” before even running similarity.
You can also pass the filter expression dynamically via Query.context if you need per-request logic.
Document merging: ConcatenationDocumentJoiner
If you use multiple retrievers (multi-query, multi-source, hybrid search), you’ll end up with many document lists. ConcatenationDocumentJoiner deduplicates and flattens them:
Map<Query, List<List<Document>>> docsPerQuery = ...;
DocumentJoiner joiner = new ConcatenationDocumentJoiner();
List<Document> merged = joiner.join(docsPerQuery);
Under the hood it’s basically:
- Flatten
Map<Query, List<List<Document>>>into a single list - Deduplicate by document ID
- Preserve scores
It’s simple but exactly what you want right before handing everything to the generation stage.
Step 4: Generation – Advisors, Context, and Error Handling
Spring AI’s RAG story really becomes ergonomic when you wire it into ChatClient advisors.
QuestionAnswerAdvisor – good default, minimal ceremony
QuestionAnswerAdvisor is the fastest way to get RAG working:
Advisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore)
.searchRequest(SearchRequest.builder()
.similarityThreshold(0.7)
.topK(5)
.build())
.build();
String answer = chatClient.prompt()
.user("如何在三个月内提升社交魅力?")
.advisors(qaAdvisor)
.call()
.content();
The advisor:
- Takes the user message
- Runs a vector search
- Stitches docs + question into a prompt
- Calls the model
You can override the prompt template if you want strict instructions (“only answer using the context above, otherwise say you don’t know”).
RetrievalAugmentationAdvisor – full modular RAG graph
For more control you switch to RetrievalAugmentationAdvisor. It lets you explicitly plug in:
- query transformers (rewrite, compress, translate)
- document retriever (vector, hybrid, custom)
- query augmenter (how to inject context into the final prompt)
Example with a query rewriter + vector retriever:
Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder()
.queryTransformers(RewriteQueryTransformer.builder()
.chatClientBuilder(ChatClient.builder(chatModel))
.build())
.documentRetriever(VectorStoreDocumentRetriever.builder()
.vectorStore(vectorStore)
.similarityThreshold(0.55)
.topK(6)
.build())
.build();
Then:
String reply = chatClient.prompt()
.user("what's the advice of career?")
.advisors(ragAdvisor)
.call()
.content();
ContextualQueryAugmenter – what to do when retrieval finds nothing
By default, RetrievalAugmentationAdvisor is conservative. If retrieval returns no docs, it swaps your user query with a “out of knowledge base, please refuse” prompt.
You can customize this using ContextualQueryAugmenter:
PromptTemplate emptyContextTemplate = new PromptTemplate("""
You are a relationship advice assistant.
The current question is outside your knowledge base.
Please respond briefly and politely in English, telling the user:
You can only answer relationship-related questions,
and invite them to describe their situation more specifically.
""");
ContextualQueryAugmenter augmenter = ContextualQueryAugmenter.builder()
.allowEmptyContext(true) // keep original question even when no docs found
.emptyContextPromptTemplate(emptyContextTemplate)
.build();
Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder() .documentRetriever(retriever) .queryAugmenter(augmenter) .build();
This gives you graceful degradation instead of raw hallucination.
### A custom advisor factory for a “love coach” bot
Putting it all together, you can hide the complexity behind a small factory:
```java
public final class LoveCoachAdvisorFactory {
private LoveCoachAdvisorFactory() {}
public static Advisor forStatus(VectorStore store, String status) {
DocumentRetriever retriever = VectorStoreDocumentRetriever.builder()
.vectorStore(store)
.similarityThreshold(0.55)
.topK(4)
.filterExpression(new FilterExpressionBuilder()
.eq("type", "love_advice")
.eq("status", status)
.build())
.build();
ContextualQueryAugmenter augmenter =
LoveAppContextualQueryAugmenterFactory.createInstance();
return RetrievalAugmentationAdvisor.builder()
.documentRetriever(retriever)
.queryAugmenter(augmenter)
.build();
}
}
Your chat layer doesn’t need to know anything about vector stores anymore; it just picks an advisor based on the user profile.
Step 5: Tuning Playbook – Making RAG Not Suck
Now to the parts you usually end up rediscovering the hard way.
1. Document strategy first, everything else later
If your knowledge base is incomplete or badly structured, no amount of thresholds or LLM trickery will save you.
Checklist:
- Do we actually have all the content users will ask about?
- Are docs written in the same language users will query with?
- Are critical terms consistently named?
- Did we strip junk (watermarks, repeated headers, screenshot-only pages)?
When in doubt, run offline retrieval tests: generate 50–100 realistic questions and see what the retriever actually surfaces.
2. Chunking: avoid both over- and under-splitting
Bad chunking shows up as:
- “Half-answer” – the retrieved chunk hints at the right area but cuts off the crucial line
- “Wall of text” – the chunk is huge, includes multiple topics, and confuses the model
Practical patterns:
- Start with something like 500–800 tokens per chunk + overlaps only if necessary
- Prefer semantic or structural boundaries (headings, sentences) to “every N tokens”
- Let cloud vendors’ “smart split” do the heavy lifting when available, but still spot-check
If you use Cloud Model Studio, enabling intelligent chunking on the knowledge base will apply a similar strategy: first split by sentence markers, then adapt chunk boundaries by semantic coherence instead of length alone. You still need to manually fix any obvious mis-splits in the console.
3. Metadata: design it like an index, not an afterthought
Good metadata makes filtering trivial:
type: faq, tutorial, policy, love_advice, interior_design...status: single, dating, married, divorced...year/version: so you can prefer newer materialsource: doc name, system, product
Implement metadata as close to the source of truth as possible (e.g. in your CMS or docs repo), then enrich with AI-only fields such as keywords or summary during ingestion.
4. Tuning similarityThreshold and topK
This is where a lot of “RAG feels off” comes from.
- If results feel incomplete (missing relevant chunks) → lower
similarityThresholdand/or raisetopK - If results feel noisy (irrelevant chunks) → raise
similarityThresholdand/or lowertopK
Rules of thumb:
- Start around
similarityThreshold = 0.6–0.7,topK = 4–6 - For narrow, highly structured domains, go higher (0.75+)
- For very heterogeneous, noisy docs, go lower but compensate with stronger metadata filters
Always test with a fixed set of labeled queries so you can see whether tuning helps or hurts.
5. Hallucination and refusal behavior
Even with perfect retrieval you’ll get edge cases. Mitigation options:
- In the RAG prompt, explicitly instruct: “If the context doesn’t contain enough information to answer, say you don’t know.”
- Use
ContextualQueryAugmenterwith a custom empty-context template that refuses out-of-domain questions gracefully - For high-risk domains (legal/medical), consider separating general LLM chat from knowledge-base-backed Q&A and forcing a refusal when retrieval is empty or very low-score
6. Multi-query expansion – use, but don’t abuse
MultiQueryExpander can boost recall by generating paraphrased queries, but:
- Every extra query costs more retrieval + more LLM time
- It’s often overkill if your docs and metadata are well-structured
If you use it, limit to 3–5 variants, deduplicate aggressively, and monitor cost and latency.
Step 6: Beyond Basics – Hybrid Retrieval & Higher-Level Architectures
Once the basics are solid, you can start layering in more advanced patterns.
Hybrid retrieval: vector + lexical + structured
No single retrieval method is perfect:
- Vector search understands meaning, but can miss strict keyword or year constraints
- Full-text / lexical search (think Elasticsearch) nails exact matches and numeric filters, but doesn’t get synonyms
- Structured search via metadata is cheap and precise, but only as good as your schema
A robust system typically combines them:
- Filter by metadata (
type,status,year) - Run vector search on the reduced candidate set
- Optionally mix in keyword search for exact matches on IDs, names, etc.
- Merge and re-rank
Spring AI doesn’t force a single pattern—DocumentRetriever is just an interface. You can write your own “hybrid retriever” that fans out to both vector store and, say, Elasticsearch, then uses ConcatenationDocumentJoiner + custom scoring.
Re-ranking and multi-stage pipelines
For large corpora, you may want a two-stage retrieval:
- Fast, approximate search (HNSW index, low-dimensional embeddings)
- Slow, precise re-ranking with a cross-encoder model (e.g., a reranker that scores each (query, chunk) pair)
The first stage optimizes recall, the second optimizes precision. Spring AI’s modular design makes it straightforward to put the reranker into the post-retrieval step before sending docs to the LLM.
Architecture patterns: central knowledge service
At system level, a neat pattern is to isolate RAG into its own service:
- A knowledge service owns ETL, vector stores, filters, and advisors
- Other microservices call it via a simple API:
answerQuestion(userId, query) - The knowledge service picks the right advisor (e.g., based on tenant, domain, user status)
This gives you:
- One place to tune thresholds and test recall
- One place to monitor cost and latency
- One place to upgrade embedding models or vector DBs
Wrap-up
RAG is not just “add embeddings, stir with LLM, ship.” It’s a pipeline:
- Document strategy decides your ceiling
- ETL decides whether you actually reach that ceiling
- Vector store + retrieval decide how much of the right stuff you bring back
- Advisors + prompts decide how safely your model uses that stuff
Spring AI gives Java developers real building blocks instead of a giant black box: DocumentReader / DocumentTransformer / DocumentWriter, VectorStore, DocumentRetriever, Advisor, and a bunch of utilities around them.
If you treat these as serious, tunable components instead of “just configuration,” your RAG system will stop feeling like a fragile demo and start feeling like a real product.
