When people talk about Retrieval-Augmented Generation (RAG), most articles sound like the same three-paragraph explanation: “LLMs, retrieval fixes things, add vectors, done.” That’s cute, but anyone who has tried to run RAG in production, under real latency budgets, with constantly changing catalogs, PDFs, product images, internal manuals, and messy enterprise data… knows it’s nothing like the demo. I learned this the hard way. Last year, we worked with an enterprise retailer whose product catalog was closer to a living organism than a dataset: 20+ million SKUs, 3–5 images per product, thousands of updates per second and more and more fun. Our early RAG prototype survived about two days before falling apart. Embeddings went stale, image–text matches drifted, latency shot past 1.5 seconds, and we had some hilarious mismatches — like a brake pad image being returned next to a lawn mower manual. That’s when we realized:Text-only RAG is not enough. Batch RAG is not enough. “Clean doc inputs” definitely do not exist.And scaling multimodal RAG requires rethinking the entire pipeline from ingestion to retrieval. This article summarizes the lessons we learned from scaling streaming multimodal RAG in real-world production systems — the practical stuff that doesn’t show up in glossy RAG architecture diagrams. Reasons of Classic RAG Stops Working at Scale RAG is simple in theory: Retrieve relevant chunks Add them to the LLM prompt Generate the answer Retrieve relevant chunks Add them to the LLM prompt Generate the answer In reality, three things hit you immediately: Enterprise data is rarely text-only Enterprise data is rarely text-only Manuals contain screenshots. Contracts include tables. E-commerce depends on product photos. A “text-only” RAG system leaves half the knowledge base unused. Manuals contain screenshots. Contracts include tables. E-commerce depends on product photos. A “text-only” RAG system leaves half the knowledge base unused. Batch indexing becomes useless For our retailer, a typical product changed 5–20 times per day. Batch updates are simply slow, latency possible. By the time your embeddings finish reindexing, the data is already outdated. For our retailer, a typical product changed 5–20 times per day. 5–20 times per day Batch updates are simply slow, latency possible. By the time your embeddings finish reindexing, the data is already outdated. Multimodal alignment is weak Images belong to specific text segments. If chunking goes wrong even slightly, retrieval quality collapses. We learned this when a model started bundling product images with unrelated marketing descriptions — the answers were technically correct but entirely ungrounded. Images belong to specific text segments. Images belong to specific text segments. If chunking goes wrong even slightly, retrieval quality collapses. We learned this when a model started bundling product images with unrelated marketing descriptions — the answers were technically correct but entirely ungrounded. If chunking goes wrong even slightly, retrieval quality collapses. We learned this when a model started bundling product images with unrelated marketing descriptions — the answers were technically correct but entirely ungrounded. The Real Problem: Multimodal Chunking After multiple back ‘n’ froths, we concluded: Chunking is the heart of multimodal RAG. If you get chunking wrong, no vector database or LLM will “keep you alive”. Text chunking we ended up using Text chunking we ended up using Sliding windows (512–1024 tokens) Section-aware splits (headings, clauses) Aggressive removal of boilerplate (menus, footers) Sliding windows (512–1024 tokens) Section-aware splits (headings, clauses) Aggressive removal of boilerplate (menus, footers) Image chunking Image chunking We store each image with: URI Caption (BLIP/SigLIP captions worked far better than CLIP for retail images) Tags (OCR, object detection) Embeddings URI Caption (BLIP/SigLIP captions worked far better than CLIP for retail images) Tags (OCR, object detection) Embeddings The “bundle everything or nothing” rule The “bundle everything or nothing” rule We eventually moved to bundling each text chunk with the nearest one or two images — not all images from the product page. This alone improved retrieval consistency by 20-25**%** in hybrid search. nearest one or two images Quick example of chunk: chunk_id: doc123_sec2_imgB text: “Section 2. Brake pad installation…” image_uri: gcs://products/brakepadB.jpg caption: “close-up photo of front brake pad” tags: ["brake pad", "front axle"] emb_text: [...] emb_image: [...] timestamp: 2025-09-24T12:01:00Z chunk_id: doc123_sec2_imgB text: “Section 2. Brake pad installation…” image_uri: gcs://products/brakepadB.jpg caption: “close-up photo of front brake pad” tags: ["brake pad", "front axle"] emb_text: [...] emb_image: [...] timestamp: 2025-09-24T12:01:00Z Small thing - huge move ;) Streaming Ingestion: The Part Everyone Underestimates 80-90% of the engineering pain came before retrieval. Our ingestion pipeline eventually looked like this: Event sourcing Event sourcing Before adopting event sourcing, we initially tried polling the upstream systems (CMS, PIM etc) through their REST APIs to detect updates. This quickly became a gap — polling introduced data staleness, API freezes (throttling), unpredictable batching, and even missed updates during traffic spikes. Moving to event-driven ingestion (Kafka, PubSub or similar) solved the latency and consistency issues and gave us real-time guarantees. Preprocessing Preprocessing This part is never fancy and simple: Language detection OCR for PDFs EXIF stripping Image normalization Language detection OCR for PDFs EXIF stripping Image normalization We lost good amount of time for debugging a single vendor whose PDFs embedded every image twice “for compatibility reasons.” Multimodal chunking Multimodal chunking The most expensive step mathematically. If you do it poorly upstream, your reranker will bring latency downstream. Embedding generation Embedding generation We used different encoders for text and images.One interesting lesson: CLIP works, but for e-commerce SigLIP performed slightly better, especially with small parts like auto components. Upserts into Vector DB Upserts into Vector DB Our rule:Never rebuild the index unless absolutely necessary.Upserts and updates keep the system fresh without downtime. The smaaall example: for event in kafka.consume("products.raw"): doc = preprocess(event) chunks = multimodal_chunk(doc) vectors = embed(chunks) for c in vectors: vector_db.upsert( id=c["chunk_id"], embeddings=[c["emb_text"], c["emb_image"]], metadata=c ) for event in kafka.consume("products.raw"): doc = preprocess(event) chunks = multimodal_chunk(doc) vectors = embed(chunks) for c in vectors: vector_db.upsert( id=c["chunk_id"], embeddings=[c["emb_text"], c["emb_image"]], metadata=c ) Indexing at Multimillion Scale Indexing at Multimillion Scale Once your embeddings exceed 100M vectors, brute force dies. Our final (as of now, there is always room for improvement, right? ;)) setup used: Text → HNSW Text → HNSW Best for low latency and high recall. Images → CLIP/SigLIP index Images → CLIP/SigLIP index Images benefit from independent tuning. Metadata → Database Metadata → Database You need fast filters (modality, timestamp, rights, price). Compression (PQ) Compression (PQ) Without compression, your memory footprint becomes financially painful.PQ gave us ~5–8× memory reduction with acceptable recall loss. Retrieval: Where Latency Battles Happen Retrieval: Where Latency Battles Happen Here’s the truth: Multimodal RAG retrieval must be hybrid, or it will fail on real users. Our pipeline: Query understanding Query understanding Detect whether the user is referring to: an image a product a concept a how-to instruction an image a product a concept a how-to instruction Hybrid retrieval Hybrid retrieval BM25 text embeddings image embeddings (if needed, hopefully not) BM25 text embeddings image embeddings (if needed, hopefully not) Merge results with RRF (Reciprocal Rank Fusion) Merge results with RRF (Reciprocal Rank Fusion) RRF saved us countless times — simple but extremely effective. Reranking Reranking Cross-modal rerankers are essential. Package context Package context The biggest trick: prioritize multimodal bundles, not raw text blobs. This improves grounding and reduces hallucinations dramatically. Evaluation: Where “It Works???” Turns Into “It Really Works, well” Evaluation: Where “It Works???” Turns Into “It Really Works, well” Evaluation for multimodal RAG is trickier than it sounds. Offline metrics Offline metrics Recall@k NDCG for mixed modality queries Image grounding accuracy Recall@k NDCG for mixed modality queries Image grounding accuracy Manual review Manual review We asked reviewers to answer: “Did the model actually use the images in its response?” Surprisingly often, RAG retrieved images but the LLM ignored them because the prompt was too text-heavy. Online metrics Online metrics Click-through Time to first token Conversation success rate Click-through Time to first token Conversation success rate These metrics helped us catch LLM misses long before users complained. Lessons We Learned the Hard Way Lessons We Learned the Hard Way Chunking matters more than models. Chunking matters more than models. A bad chunk pollutes everything upstream and downstream. Streaming ingestion is non-negotiable. Streaming ingestion is non-negotiable. If your data changes more than once a day, batch updates will betray you. Compression is not optional at 100M+ vectors. Compression is not optional at 100M+ vectors. You’ll pay for it in latency or in money or both ;). Multimodal evaluations must be multimodal. Multimodal evaluations must be multimodal. Text-only metrics give a dangerously false sense of quality. Latency is won or lost before the LLM. Latency is won or lost before the LLM. Retriever <= 200 msReranker <= 100 msEverything else becomes user perception. Where Multimodal RAG Is Heading Where Multimodal RAG Is Heading A few directions we see coming fast: Dynamic chunking → chunk size adapts to query intent Self-learning retrievers → trained from click logs Edge RAG → indexes pushed closer to users Synthetic multimodal Q&A data → boosts reranker accuracy Cross-modal governance → track where each image originated Dynamic chunking → chunk size adapts to query intent Self-learning retrievers → trained from click logs Edge RAG → indexes pushed closer to users Synthetic multimodal Q&A data → boosts reranker accuracy Cross-modal governance → track where each image originated Conclusion Conclusion Multimodal RAG isn’t just “RAG but with images.”It’s an entire architectural shift — one that forces you to rethink ingestion, chunking, indexing, retrieval, and evaluation from the ground up and then repeat until result has been achieved. If you’re scaling RAG beyond prototypes, expect to spend more time fixing pipelines, chunkers, and ingestion code than tuning LLMs.But once it works, the experience is transformative: grounded answers, real-time updates, and retrieval that understands both language and vision. RAG is no longer a toy.It’s becoming enterprise infrastructure — and scaling it is where the real engineering begins.