Spring AI RAG, Demystified: From Toy Demos to Production-Grade Retrieval

RAG went from research paper acronym to “every slide deck ever” in about a year. The pattern is simple on paper: Embed documents → embed question → find similar chunks → feed to LLM. Embed documents → embed question → find similar chunks → feed to LLM. Embed documents → embed question → find similar chunks → feed to LLM. But if you’ve tried building something non-trivial (say, a domain expert assistant or an internal knowledge bot), you already know the bad news: Chunking is either too small (you lose context) or too big (you drag in noise) Vector DB queries return “something vaguely related” instead of exactly what you need The model makes stuff up when retrieval fails, but does it confidently Every blog post explains the concept, almost none show how to tune the system Chunking is either too small (you lose context) or too big (you drag in noise) too small too big Vector DB queries return “something vaguely related” instead of exactly what you need The model makes stuff up when retrieval fails, but does it confidently confidently Every blog post explains the concept, almost none show how to tune the system tune Spring AI actually ships a pretty solid set of primitives for doing RAG properly. The goal of this article is to walk through those pieces as an end-to-end workflow, and show where you can twist the knobs in real projects. properly We’ll follow the real life-cycle of a RAG system: Indexing / ETL – document ingestion, cleaning, chunking, metadata Vectorization & storage – embeddings, vector DBs, batching Retrieval – pre-retrieval query shaping, semantic search, filters, merging Generation – query + context orchestration, error handling, and advisors Tuning & advanced patterns – thresholds, chunk sizes, hybrid retrieval, and more Indexing / ETL – document ingestion, cleaning, chunking, metadata Indexing / ETL Vectorization & storage – embeddings, vector DBs, batching Vectorization & storage Retrieval – pre-retrieval query shaping, semantic search, filters, merging Retrieval Generation – query + context orchestration, error handling, and advisors Generation Tuning & advanced patterns – thresholds, chunk sizes, hybrid retrieval, and more Tuning & advanced patterns All examples are in Java/Spring, but the ideas carry over to any stack. Step 1: ETL – Turning Messy Docs into AI-Native Knowledge A lot of RAG failures are caused before the first token hits the LLM: the documents themselves are a mess. Make “AI-native” documents first If you’re serious about RAG, stop thinking of your PDFs as ground truth. Instead, think in terms of AI-native documents: AI-native documents Clear hierarchical structure (headings, subheadings) Long lists split into logical groups Consistent terminology Minimal noise: watermarks, giant tables, screenshots of code, etc. Clear hierarchical structure (headings, subheadings) hierarchical structure Long lists split into logical groups Consistent terminology terminology Minimal noise: watermarks, giant tables, screenshots of code, etc. You can absolutely use an LLM offline to normalize docs into a clean Markdown or HTML format before they ever hit Spring AI. Spring AI’s document model Spring AI wraps content as a Document: Document content: the text that will eventually be embedded metadata: a Map with arbitrary keys (tags like type, year, status, source…) optional attachments (for multi-modal scenarios) content: the text that will eventually be embedded content metadata: a Map with arbitrary keys (tags like type, year, status, source…) metadata Map type year status source optional attachments (for multi-modal scenarios) attachments The ETL pipeline is built around three interfaces: DocumentReader – Extract: load raw docs from somewhere DocumentTransformer – Transform: split, enrich, format DocumentWriter – Load: write into a vector DB, file, Redis, etc. DocumentReader – Extract: load raw docs from somewhere DocumentReader Extract DocumentTransformer – Transform: split, enrich, format DocumentTransformer Transform DocumentWriter – Load: write into a vector DB, file, Redis, etc. DocumentWriter Load Extract: DocumentReader Spring AI ships readers for JSON, text, Markdown, PDF, HTML, and more. For example, a JSON reader that pulls specific paths with JSON Pointer: @Component public class ProductJsonReader { private final Resource resource; public ProductJsonReader(@Value("classpath:products.json") Resource resource) { this.resource = resource; } public List read() { JsonDocumentReaderConfig config = JsonDocumentReaderConfig.builder() .withPointers("/products/*/description") .withAdditionalMetadata("source", "product-catalog") .build(); JsonDocumentReader reader = new JsonDocumentReader(resource, config); return reader.get(); } } @Component public class ProductJsonReader { private final Resource resource; public ProductJsonReader(@Value("classpath:products.json") Resource resource) { this.resource = resource; } public List read() { JsonDocumentReaderConfig config = JsonDocumentReaderConfig.builder() .withPointers("/products/*/description") .withAdditionalMetadata("source", "product-catalog") .build(); JsonDocumentReader reader = new JsonDocumentReader(resource, config); return reader.get(); } } The same pattern applies for Markdown, PDFs, emails, videos, GitHub docs, databases, etc. Each reader turns your “source world” into List . List Transform: chunking, enrichment, formatting This is where most of the interesting tuning happens. 1. Chunking with TokenTextSplitter TokenTextSplitter is the workhorse splitter based on token counts + simple heuristics (sentence boundaries, newlines, etc.). It’s a DocumentTransformer, so you can stack it with others: TokenTextSplitter DocumentTransformer @Component public class SmartChunkTransformer { public List split(List docs) { // slightly smaller chunks than the defaults TokenTextSplitter splitter = new TokenTextSplitter( 700, // target tokens per chunk 280, // min chars before we try to break 8, // min chunk length to embed 8000, // max number of chunks true // keep separators like newlines ); return splitter.apply(docs); } } @Component public class SmartChunkTransformer { public List split(List docs) { // slightly smaller chunks than the defaults TokenTextSplitter splitter = new TokenTextSplitter( 700, // target tokens per chunk 280, // min chars before we try to break 8, // min chunk length to embed 8000, // max number of chunks true // keep separators like newlines ); return splitter.apply(docs); } } Play with chunk size per use-case: per use-case Legal or design specs → longer chunks, you need intact reasoning blocks Chatty Q&A or support docs → shorter chunks, you want sharp topical slices Legal or design specs → longer chunks, you need intact reasoning blocks Chatty Q&A or support docs → shorter chunks, you want sharp topical slices 2. Metadata enrichment with an LLM Don’t rely only on raw text similarity. Use an LLM once during ingestion to extract semantic features into metadata, then filter on them cheaply at query time. Spring AI has KeywordMetadataEnricher and SummaryMetadataEnricher, both built on a ChatModel: KeywordMetadataEnricher SummaryMetadataEnricher ChatModel @Component public class KeywordEnricher { private final ChatModel chatModel; public KeywordEnricher(ChatModel chatModel) { this.chatModel = chatModel; } public List enrich(List docs) { KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, List.of("keywords", "category")); return enricher.apply(docs); } } @Component public class KeywordEnricher { private final ChatModel chatModel; public KeywordEnricher(ChatModel chatModel) { this.chatModel = chatModel; } public List enrich(List docs) { KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, List.of("keywords", "category")); return enricher.apply(docs); } } You can chain it right after chunking: split → add keywords → maybe add summaries. 3. DefaultContentFormatter – underused but powerful DefaultContentFormatter lets you control exactly how metadata + content are stitched into the final string that goes into the embedding or the prompt. DefaultContentFormatter exactly Example: DefaultContentFormatter formatter = DefaultContentFormatter.builder() .withMetadataTemplate("{key}: {value}") .withMetadataSeparator(" ") .withTextTemplate("{metadata_string} {content}") // Don’t leak embedding IDs into prompts .withExcludedInferenceMetadataKeys("embedding_id", "vector_id") .build(); DefaultContentFormatter formatter = DefaultContentFormatter.builder() .withMetadataTemplate("{key}: {value}") .withMetadataSeparator(" ") .withTextTemplate("{metadata_string} {content}") // Don’t leak embedding IDs into prompts .withExcludedInferenceMetadataKeys("embedding_id", "vector_id") .build(); Use it when you want the LLM to see things like: type: love_advice status: single year: 2025 如何提升自己的恋爱吸引力？... type: love_advice status: single year: 2025 如何提升自己的恋爱吸引力？... instead of a bare blob of text. Load: DocumentWriter & ETL wiring DocumentWriter is just Consumer >. There’s FileDocumentWriter (for plain files) and VectorStoreWriter (for embedding + vector DB). DocumentWriter Consumer > FileDocumentWriter VectorStoreWriter A minimal ETL could look like this: @Component public class KnowledgeBaseIndexer { private final ProductJsonReader reader; private final SmartChunkTransformer splitter; private final KeywordEnricher enricher; private final VectorStore vectorStore; public KnowledgeBaseIndexer(ProductJsonReader reader, SmartChunkTransformer splitter, KeywordEnricher enricher, VectorStore vectorStore) { this.reader = reader; this.splitter = splitter; this.enricher = enricher; this.vectorStore = vectorStore; } public void rebuildIndex() { List raw = reader.read(); List chunks = splitter.split(raw); List enriched = enricher.enrich(chunks); vectorStore.add(enriched); } } @Component public class KnowledgeBaseIndexer { private final ProductJsonReader reader; private final SmartChunkTransformer splitter; private final KeywordEnricher enricher; private final VectorStore vectorStore; public KnowledgeBaseIndexer(ProductJsonReader reader, SmartChunkTransformer splitter, KeywordEnricher enricher, VectorStore vectorStore) { this.reader = reader; this.splitter = splitter; this.enricher = enricher; this.vectorStore = vectorStore; } public void rebuildIndex() { List raw = reader.read(); List chunks = splitter.split(raw); List enriched = enricher.enrich(chunks); vectorStore.add(enriched); } } This pipeline alone already puts you ahead of most “we just embedded everything once” demos. Step 2: Vectorization & Storage – Choosing Your Retrieval Backbone Once you have clean, chunked, enriched documents, you need a place to put them. Spring AI’s VectorStore interface is intentionally simple: VectorStore public interface VectorStore extends DocumentWriter { void add(List documents); void delete(List ids); void delete(FilterExpression filterExpression); List similaritySearch(SearchRequest request); default String getName() { return getClass().getSimpleName(); } } public interface VectorStore extends DocumentWriter { void add(List documents); void delete(List ids); void delete(FilterExpression filterExpression); List similaritySearch(SearchRequest request); default String getName() { return getClass().getSimpleName(); } } The important bit for tuning is SearchRequest: SearchRequest SearchRequest request = SearchRequest.builder() .query("How does Spring AI handle RAG?") .topK(5) .similarityThreshold(0.75) .filterExpression("category == 'spring-ai' && year >= '2024'") .build(); List docs = vectorStore.similaritySearch(request); SearchRequest request = SearchRequest.builder() .query("How does Spring AI handle RAG?") .topK(5) .similarityThreshold(0.75) .filterExpression("category == 'spring-ai' && year >= '2024'") .build(); List docs = vectorStore.similaritySearch(request); Which vector store? Spring AI ships starters for many backends: in-memory, Redis, Elasticsearch, PGVector, Qdrant, etc. Spring AI adds cloud-native ones via DashScope (DashScopeCloudStore). DashScopeCloudStore For backend-heavy Java shops, PGVector on PostgreSQL is incredibly pragmatic: PGVector on PostgreSQL You probably already run Postgres You keep relational + vector data in one place No extra infrastructure team to convince You probably already run Postgres You keep relational + vector data in one place No extra infrastructure team to convince PGVector with Spring AI (hand-rolled config) Instead of relying on auto-config, you can wire PgVectorStore yourself and pick exactly which EmbeddingModel you want: PgVectorStore EmbeddingModel org.springframework.boot spring-boot-starter-jdbc org.postgresql postgresql runtime org.springframework.ai spring-ai-pgvector-store 1.0.0-M7 org.springframework.boot spring-boot-starter-jdbc org.postgresql postgresql runtime org.springframework.ai spring-ai-pgvector-store 1.0.0-M7 # application.yml spring: datasource: url: jdbc:postgresql://YOUR_HOST:5432/rag_demo username: rag_user password: super_secret ai: vectorstore: pgvector: index-type: HNSW distance-type: COSINE_DISTANCE # dimensions: 1536 # omit to let it derive from the embedding model # application.yml spring: datasource: url: jdbc:postgresql://YOUR_HOST:5432/rag_demo username: rag_user password: super_secret ai: vectorstore: pgvector: index-type: HNSW distance-type: COSINE_DISTANCE # dimensions: 1536 # omit to let it derive from the embedding model Now the config class: @Configuration public class PgVectorConfig { @Bean public VectorStore pgVectorStore(JdbcTemplate jdbcTemplate, @Qualifier("dashscopeEmbeddingModel") EmbeddingModel embeddingModel) { return PgVectorStore.builder(jdbcTemplate, embeddingModel) .dimensions(1536) // match your embedding model .distanceType(PgDistanceType.COSINE_DISTANCE) .build(); } } @Configuration public class PgVectorConfig { @Bean public VectorStore pgVectorStore(JdbcTemplate jdbcTemplate, @Qualifier("dashscopeEmbeddingModel") EmbeddingModel embeddingModel) { return PgVectorStore.builder(jdbcTemplate, embeddingModel) .dimensions(1536) // match your embedding model .distanceType(PgDistanceType.COSINE_DISTANCE) .build(); } } A common gotcha: if you use multiple EmbeddingModel beans (e.g., Ollama + DashScope), make sure you qualify the one you actually want for this store. multiple EmbeddingModel BatchingStrategy – don’t blow up your embed API Embedding thousands of chunks in a single call will eventually hit context window or rate limits. Spring AI’s BatchingStrategy lets you split documents into sane batches before embedding: BatchingStrategy @Configuration public class EmbeddingBatchConfig { @Bean public BatchingStrategy batchingStrategy() { return new TokenCountBatchingStrategy( EncodingType.CL100K_BASE, 8192, // max tokens per batch 0.15 // leave some safety headroom ); } } @Configuration public class EmbeddingBatchConfig { @Bean public BatchingStrategy batchingStrategy() { return new TokenCountBatchingStrategy( EncodingType.CL100K_BASE, 8192, // max tokens per batch 0.15 // leave some safety headroom ); } } You can also implement your own BatchingStrategy if your vector DB has, for example, hard throughput limits and you want to throttle inserts explicitly. BatchingStrategy Step 3: Retrieval – Query Shaping, Filters, and Result Merging Once your store is populated, you still can’t just do similaritySearch(userText) and call it a day. still similaritySearch(userText) Spring AI breaks retrieval into pre-retrieval, retrieval, and post-retrieval stages. Pre-retrieval: shaping the query RewriteQueryTransformer – clean up messy user queries Users don’t speak like search queries. RewriteQueryTransformer uses an LLM to rewrite a noisy query into something more explicit and model-friendly. RewriteQueryTransformer @Component public class QueryRewriter { private final QueryTransformer transformer; public QueryRewriter(ChatModel chatModel) { ChatClient.Builder builder = ChatClient.builder(chatModel); this.transformer = RewriteQueryTransformer.builder() .chatClientBuilder(builder) .build(); } public Query rewrite(String text) { return transformer.transform(new Query(text)); } } @Component public class QueryRewriter { private final QueryTransformer transformer; public QueryRewriter(ChatModel chatModel) { ChatClient.Builder builder = ChatClient.builder(chatModel); this.transformer = RewriteQueryTransformer.builder() .chatClientBuilder(builder) .build(); } public Query rewrite(String text) { return transformer.transform(new Query(text)); } } Plug this into your RAG pipeline right before you call the retriever. TranslationQueryTransformer – cross-language users, single-language embeddings If your embedding model is English-only but your users speak Chinese, Spanish, etc., you can stick a TranslationQueryTransformer in front. It’s literally “LLM-as-translation-layer” – simple but not cheap. For production, many teams prefer a dedicated translation API + custom transformer. TranslationQueryTransformer CompressionQueryTransformer – distill long chat history Multi-turn chats tend to accumulate context. CompressionQueryTransformer compresses history + latest user message into one standalone query. Perfect when you use conversation history but your vector search only sees the final “intent”. CompressionQueryTransformer Retrieval: VectorStoreDocumentRetriever The main entry point for document retrieval is DocumentRetriever. For vector-based RAG, you usually use VectorStoreDocumentRetriever: DocumentRetriever VectorStoreDocumentRetriever DocumentRetriever retriever = VectorStoreDocumentRetriever.builder() .vectorStore(vectorStore) .similarityThreshold(0.6) .topK(4) .filterExpression(new FilterExpressionBuilder() .eq("type", "love_advice") .eq("status", "single") .build()) .build(); List docs = retriever.retrieve(new Query("怎么提升自己的吸引力？")); DocumentRetriever retriever = VectorStoreDocumentRetriever.builder() .vectorStore(vectorStore) .similarityThreshold(0.6) .topK(4) .filterExpression(new FilterExpressionBuilder() .eq("type", "love_advice") .eq("status", "single") .build()) .build(); List docs = retriever.retrieve(new Query("怎么提升自己的吸引力？")); Filters here are metadata filters, not semantic. This is why earlier enrichment pays off: you can reduce the search space to just “love_advice + single” before even running similarity. metadata filters You can also pass the filter expression dynamically via Query.context if you need per-request logic. Query.context Document merging: ConcatenationDocumentJoiner If you use multiple retrievers (multi-query, multi-source, hybrid search), you’ll end up with many document lists. ConcatenationDocumentJoiner deduplicates and flattens them: ConcatenationDocumentJoiner Map >> docsPerQuery = ...; DocumentJoiner joiner = new ConcatenationDocumentJoiner(); List merged = joiner.join(docsPerQuery); Map >> docsPerQuery = ...; DocumentJoiner joiner = new ConcatenationDocumentJoiner(); List merged = joiner.join(docsPerQuery); Under the hood it’s basically: Flatten Map >> into a single list Deduplicate by document ID Preserve scores Flatten Map >> into a single list Map >> Deduplicate by document ID Preserve scores It’s simple but exactly what you want right before handing everything to the generation stage. Step 4: Generation – Advisors, Context, and Error Handling Spring AI’s RAG story really becomes ergonomic when you wire it into ChatClient advisors. ChatClient QuestionAnswerAdvisor – good default, minimal ceremony QuestionAnswerAdvisor is the fastest way to get RAG working: QuestionAnswerAdvisor Advisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore) .searchRequest(SearchRequest.builder() .similarityThreshold(0.7) .topK(5) .build()) .build(); String answer = chatClient.prompt() .user("如何在三个月内提升社交魅力？") .advisors(qaAdvisor) .call() .content(); Advisor qaAdvisor = QuestionAnswerAdvisor.builder(vectorStore) .searchRequest(SearchRequest.builder() .similarityThreshold(0.7) .topK(5) .build()) .build(); String answer = chatClient.prompt() .user("如何在三个月内提升社交魅力？") .advisors(qaAdvisor) .call() .content(); The advisor: Takes the user message Runs a vector search Stitches docs + question into a prompt Calls the model Takes the user message Runs a vector search Stitches docs + question into a prompt Calls the model You can override the prompt template if you want strict instructions (“only answer using the context above, otherwise say you don’t know”). RetrievalAugmentationAdvisor – full modular RAG graph For more control you switch to RetrievalAugmentationAdvisor. It lets you explicitly plug in: RetrievalAugmentationAdvisor query transformers (rewrite, compress, translate) document retriever (vector, hybrid, custom) query augmenter (how to inject context into the final prompt) query transformers (rewrite, compress, translate) query transformers document retriever (vector, hybrid, custom) document retriever query augmenter (how to inject context into the final prompt) query augmenter Example with a query rewriter + vector retriever: Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder() .queryTransformers(RewriteQueryTransformer.builder() .chatClientBuilder(ChatClient.builder(chatModel)) .build()) .documentRetriever(VectorStoreDocumentRetriever.builder() .vectorStore(vectorStore) .similarityThreshold(0.55) .topK(6) .build()) .build(); Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder() .queryTransformers(RewriteQueryTransformer.builder() .chatClientBuilder(ChatClient.builder(chatModel)) .build()) .documentRetriever(VectorStoreDocumentRetriever.builder() .vectorStore(vectorStore) .similarityThreshold(0.55) .topK(6) .build()) .build(); Then: String reply = chatClient.prompt() .user("what's the advice of career？") .advisors(ragAdvisor) .call() .content(); String reply = chatClient.prompt() .user("what's the advice of career？") .advisors(ragAdvisor) .call() .content(); ContextualQueryAugmenter – what to do when retrieval finds nothing By default, RetrievalAugmentationAdvisor is conservative. If retrieval returns no docs, it swaps your user query with a “out of knowledge base, please refuse” prompt. RetrievalAugmentationAdvisor You can customize this using ContextualQueryAugmenter: ContextualQueryAugmenter PromptTemplate emptyContextTemplate = new PromptTemplate(""" You are a relationship advice assistant. The current question is outside your knowledge base. Please respond briefly and politely in English, telling the user: You can only answer relationship-related questions, and invite them to describe their situation more specifically. """); ContextualQueryAugmenter augmenter = ContextualQueryAugmenter.builder() .allowEmptyContext(true) // keep original question even when no docs found .emptyContextPromptTemplate(emptyContextTemplate) .build(); PromptTemplate emptyContextTemplate = new PromptTemplate(""" You are a relationship advice assistant. The current question is outside your knowledge base. Please respond briefly and politely in English, telling the user: You can only answer relationship-related questions, and invite them to describe their situation more specifically. """); ContextualQueryAugmenter augmenter = ContextualQueryAugmenter.builder() .allowEmptyContext(true) // keep original question even when no docs found .emptyContextPromptTemplate(emptyContextTemplate) .build(); Advisor ragAdvisor = RetrievalAugmentationAdvisor.builder() .documentRetriever(retriever) .queryAugmenter(augmenter) .build(); This gives you graceful degradation instead of raw hallucination. ### A custom advisor factory for a “love coach” bot Putting it all together, you can hide the complexity behind a small factory: ```java public final class LoveCoachAdvisorFactory { private LoveCoachAdvisorFactory() {} public static Advisor forStatus(VectorStore store, String status) { DocumentRetriever retriever = VectorStoreDocumentRetriever.builder() .vectorStore(store) .similarityThreshold(0.55) .topK(4) .filterExpression(new FilterExpressionBuilder() .eq("type", "love_advice") .eq("status", status) .build()) .build(); ContextualQueryAugmenter augmenter = LoveAppContextualQueryAugmenterFactory.createInstance(); return RetrievalAugmentationAdvisor.builder() .documentRetriever(retriever) .queryAugmenter(augmenter) .build(); } } This gives you graceful degradation instead of raw hallucination. ### A custom advisor factory for a “love coach” bot Putting it all together, you can hide the complexity behind a small factory: ```java public final class LoveCoachAdvisorFactory { private LoveCoachAdvisorFactory() {} public static Advisor forStatus(VectorStore store, String status) { DocumentRetriever retriever = VectorStoreDocumentRetriever.builder() .vectorStore(store) .similarityThreshold(0.55) .topK(4) .filterExpression(new FilterExpressionBuilder() .eq("type", "love_advice") .eq("status", status) .build()) .build(); ContextualQueryAugmenter augmenter = LoveAppContextualQueryAugmenterFactory.createInstance(); return RetrievalAugmentationAdvisor.builder() .documentRetriever(retriever) .queryAugmenter(augmenter) .build(); } } Your chat layer doesn’t need to know anything about vector stores anymore; it just picks an advisor based on the user profile. Step 5: Tuning Playbook – Making RAG Not Suck Now to the parts you usually end up rediscovering the hard way. 1. Document strategy first, everything else later If your knowledge base is incomplete or badly structured, no amount of thresholds or LLM trickery will save you. Checklist: Do we actually have all the content users will ask about? Are docs written in the same language users will query with? Are critical terms consistently named? Did we strip junk (watermarks, repeated headers, screenshot-only pages)? Do we actually have all the content users will ask about? all Are docs written in the same language users will query with? same language Are critical terms consistently named? Did we strip junk (watermarks, repeated headers, screenshot-only pages)? When in doubt, run offline retrieval tests: generate 50–100 realistic questions and see what the retriever actually surfaces. 2. Chunking: avoid both over- and under-splitting Bad chunking shows up as: “Half-answer” – the retrieved chunk hints at the right area but cuts off the crucial line “Wall of text” – the chunk is huge, includes multiple topics, and confuses the model “Half-answer” – the retrieved chunk hints at the right area but cuts off the crucial line “Wall of text” – the chunk is huge, includes multiple topics, and confuses the model Practical patterns: Start with something like 500–800 tokens per chunk + overlaps only if necessary Prefer semantic or structural boundaries (headings, sentences) to “every N tokens” Let cloud vendors’ “smart split” do the heavy lifting when available, but still spot-check Start with something like 500–800 tokens per chunk + overlaps only if necessary 500–800 tokens Prefer semantic or structural boundaries (headings, sentences) to “every N tokens” semantic or structural boundaries Let cloud vendors’ “smart split” do the heavy lifting when available, but still spot-check “smart split” If you use Cloud Model Studio, enabling intelligent chunking on the knowledge base will apply a similar strategy: first split by sentence markers, then adapt chunk boundaries by semantic coherence instead of length alone. You still need to manually fix any obvious mis-splits in the console. 3. Metadata: design it like an index, not an afterthought Good metadata makes filtering trivial: type: faq, tutorial, policy, love_advice, interior_design... status: single, dating, married, divorced... year / version: so you can prefer newer material source: doc name, system, product type: faq, tutorial, policy, love_advice, interior_design... type status: single, dating, married, divorced... status year / version: so you can prefer newer material year version source: doc name, system, product source Implement metadata as close to the source of truth as possible (e.g. in your CMS or docs repo), then enrich with AI-only fields such as keywords or summary during ingestion. source of truth keywords summary 4. Tuning similarityThreshold and topK This is where a lot of “RAG feels off” comes from. If results feel incomplete (missing relevant chunks) → lower similarityThreshold and/or raise topK If results feel noisy (irrelevant chunks) → raise similarityThreshold and/or lower topK If results feel incomplete (missing relevant chunks) → lower similarityThreshold and/or raise topK incomplete lower similarityThreshold raise topK If results feel noisy (irrelevant chunks) → raise similarityThreshold and/or lower topK noisy raise similarityThreshold lower topK Rules of thumb: Start around similarityThreshold = 0.6–0.7, topK = 4–6 For narrow, highly structured domains, go higher (0.75+) For very heterogeneous, noisy docs, go lower but compensate with stronger metadata filters Start around similarityThreshold = 0.6–0.7, topK = 4–6 similarityThreshold = 0.6–0.7 topK = 4–6 For narrow, highly structured domains, go higher (0.75+) For very heterogeneous, noisy docs, go lower but compensate with stronger metadata filters stronger metadata filters Always test with a fixed set of labeled queries so you can see whether tuning helps or hurts. fixed set of labeled queries 5. Hallucination and refusal behavior Even with perfect retrieval you’ll get edge cases. Mitigation options: In the RAG prompt, explicitly instruct: “If the context doesn’t contain enough information to answer, say you don’t know.” Use ContextualQueryAugmenter with a custom empty-context template that refuses out-of-domain questions gracefully For high-risk domains (legal/medical), consider separating general LLM chat from knowledge-base-backed Q&A and forcing a refusal when retrieval is empty or very low-score In the RAG prompt, explicitly instruct: “If the context doesn’t contain enough information to answer, say you don’t know.” “If the context doesn’t contain enough information to answer, say you don’t know.” Use ContextualQueryAugmenter with a custom empty-context template that refuses out-of-domain questions gracefully ContextualQueryAugmenter empty-context template For high-risk domains (legal/medical), consider separating general LLM chat from knowledge-base-backed Q&A and forcing a refusal when retrieval is empty or very low-score separating 6. Multi-query expansion – use, but don’t abuse MultiQueryExpander can boost recall by generating paraphrased queries, but: MultiQueryExpander Every extra query costs more retrieval + more LLM time It’s often overkill if your docs and metadata are well-structured Every extra query costs more retrieval + more LLM time more retrieval + more LLM time It’s often overkill if your docs and metadata are well-structured If you use it, limit to 3–5 variants, deduplicate aggressively, and monitor cost and latency. 3–5 variants Step 6: Beyond Basics – Hybrid Retrieval & Higher-Level Architectures Once the basics are solid, you can start layering in more advanced patterns. Hybrid retrieval: vector + lexical + structured No single retrieval method is perfect: Vector search understands meaning, but can miss strict keyword or year constraints Full-text / lexical search (think Elasticsearch) nails exact matches and numeric filters, but doesn’t get synonyms Structured search via metadata is cheap and precise, but only as good as your schema Vector search understands meaning, but can miss strict keyword or year constraints Full-text / lexical search (think Elasticsearch) nails exact matches and numeric filters, but doesn’t get synonyms exact matches Structured search via metadata is cheap and precise, but only as good as your schema A robust system typically combines them: Filter by metadata (type, status, year) Run vector search on the reduced candidate set Optionally mix in keyword search for exact matches on IDs, names, etc. Merge and re-rank Filter by metadata (type, status, year) type status year Run vector search on the reduced candidate set Optionally mix in keyword search for exact matches on IDs, names, etc. Merge and re-rank Spring AI doesn’t force a single pattern—DocumentRetriever is just an interface. You can write your own “hybrid retriever” that fans out to both vector store and, say, Elasticsearch, then uses ConcatenationDocumentJoiner + custom scoring. DocumentRetriever ConcatenationDocumentJoiner Re-ranking and multi-stage pipelines For large corpora, you may want a two-stage retrieval: two-stage retrieval Fast, approximate search (HNSW index, low-dimensional embeddings) Slow, precise re-ranking with a cross-encoder model (e.g., a reranker that scores each (query, chunk) pair) Fast, approximate search (HNSW index, low-dimensional embeddings) Slow, precise re-ranking with a cross-encoder model (e.g., a reranker that scores each (query, chunk) pair) The first stage optimizes recall, the second optimizes precision. Spring AI’s modular design makes it straightforward to put the reranker into the post-retrieval step before sending docs to the LLM. recall precision post-retrieval Architecture patterns: central knowledge service At system level, a neat pattern is to isolate RAG into its own service: A knowledge service owns ETL, vector stores, filters, and advisors Other microservices call it via a simple API: answerQuestion(userId, query) The knowledge service picks the right advisor (e.g., based on tenant, domain, user status) A knowledge service owns ETL, vector stores, filters, and advisors knowledge service Other microservices call it via a simple API: answerQuestion(userId, query) answerQuestion(userId, query) The knowledge service picks the right advisor (e.g., based on tenant, domain, user status) This gives you: One place to tune thresholds and test recall One place to monitor cost and latency One place to upgrade embedding models or vector DBs One place to tune thresholds and test recall One place to monitor cost and latency One place to upgrade embedding models or vector DBs Wrap-up RAG is not just “add embeddings, stir with LLM, ship.” It’s a pipeline: Document strategy decides your ceiling ETL decides whether you actually reach that ceiling Vector store + retrieval decide how much of the right stuff you bring back Advisors + prompts decide how safely your model uses that stuff Document strategy decides your ceiling Document strategy ETL decides whether you actually reach that ceiling ETL Vector store + retrieval decide how much of the right stuff you bring back Vector store + retrieval Advisors + prompts decide how safely your model uses that stuff Advisors + prompts Spring AI gives Java developers real building blocks instead of a giant black box: DocumentReader / DocumentTransformer / DocumentWriter, VectorStore, DocumentRetriever, Advisor, and a bunch of utilities around them. DocumentReader DocumentTransformer DocumentWriter VectorStore DocumentRetriever Advisor If you treat these as serious, tunable components instead of “just configuration,” your RAG system will stop feeling like a fragile demo and start feeling like a real product.