Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant document snippets to improve responses. With rlama, you can build a fully local, offline RAG system—no cloud services, no dependencies, and complete data privacy. While rlama supports both large and small LLMs, it is especially optimized for smaller models without sacrificing flexibility for larger ones. rlama rlama Introduction to RAG and rlama rlama In RAG, a knowledge store is queried to retrieve pertinent documents added to the LLM prompt. This helps ground the model’s output with factual, up-to-date data. Traditional RAG setups require multiple components (document loaders, text splitters, vector databases, etc.), but rlama streamlines the entire process with a single CLI tool. rlama It handles: It handles: Document ingestion and chunking. Embedding generation via local models (using Ollama). Storage in a hybrid vector store that supports both semantic and textual queries. Querying to retrieve context and generate answers. Document ingestion and chunking. Embedding generation via local models (using Ollama). Storage in a hybrid vector store that supports both semantic and textual queries. Querying to retrieve context and generate answers. This local-first approach ensures privacy, speed, and ease of management. Step-by-Step Guide to Implementing RAG with rlama rlama 1. Installation Ensure you have Ollama installed. Then, run: Ollama curl -fsSL https://raw.githubusercontent.com/dontizi/rlama/main/install.sh | sh curl -fsSL https://raw.githubusercontent.com/dontizi/rlama/main/install.sh | sh Verify the installation: rlama --version rlama --version 2. Creating a RAG System Index your documents by creating a RAG store (hybrid vector store): rlama rag <model> <rag-name> <folder-path> rlama rag <model> <rag-name> <folder-path> For example, using a model like deepseek-r1:8b: deepseek-r1:8b rlama rag deepseek-r1:8b mydocs ./docs rlama rag deepseek-r1:8b mydocs ./docs This command: This command: Scans your specified folder (recursively) for supported files. Converts documents to plain text and splits them into chunks (default: moderate size with overlap). Generates embeddings for each chunk using the specified model. Stores chunks and metadata in a local hybrid vector store (in ~/.rlama/mydocs). Scans your specified folder (recursively) for supported files. Converts documents to plain text and splits them into chunks (default: moderate size with overlap). Generates embeddings for each chunk using the specified model. Stores chunks and metadata in a local hybrid vector store (in ~/.rlama/mydocs). ~/.rlama/mydocs 3. Managing Documents Keep your index updated: Add Documents: rlama add-docs mydocs ./new_docs --exclude-ext=.log List Documents: rlama list-docs mydocs Inspect Chunks: rlama list-chunks mydocs --document=filename Update Model: rlama update-model mydocs <new-model> Add Documents: rlama add-docs mydocs ./new_docs --exclude-ext=.log Add Documents: Add Documents: rlama add-docs mydocs ./new_docs --exclude-ext=.log rlama add-docs mydocs ./new_docs --exclude-ext=.log List Documents: rlama list-docs mydocs List Documents: List Documents: rlama list-docs mydocs rlama list-docs mydocs Inspect Chunks: rlama list-chunks mydocs --document=filename Inspect Chunks: Inspect Chunks: rlama list-chunks mydocs --document=filename rlama list-chunks mydocs --document=filename Update Model: rlama update-model mydocs <new-model> Update Model: Update Model: rlama update-model mydocs <new-model> rlama update-model mydocs <new-model> 4. Configuring Chunking and Retrieval Chunk Size & Overlap: Chunks are pieces of text (e.g. ~300-500 tokens) that enable precise retrieval. Smaller chunks yield higher precision; larger ones preserve context. Overlapping (about 10-20% of chunk size) ensures continuity. Chunk Size & Overlap: Chunks are pieces of text (e.g. ~300-500 tokens) that enable precise retrieval. Smaller chunks yield higher precision; larger ones preserve context. Overlapping (about 10-20% of chunk size) ensures continuity. Chunk Size & Overlap: Context Size: The --context-size flag controls how many chunks are retrieved per query (default is 20). For concise queries, 5-10 chunks might be sufficient, while broader questions might require 30 or more. Ensure the total token count (chunks + query) stays within your LLM’s limit. Context Size: The --context-size flag controls how many chunks are retrieved per query (default is 20). For concise queries, 5-10 chunks might be sufficient, while broader questions might require 30 or more. Ensure the total token count (chunks + query) stays within your LLM’s limit. Context Size: --context-size Hybrid Retrieval: While rlama primarily uses dense vector search, it stores the original text to support textual queries. This means you get both semantic matching and the ability to reference specific text snippets. Hybrid Retrieval: While rlama primarily uses dense vector search, it stores the original text to support textual queries. This means you get both semantic matching and the ability to reference specific text snippets. Hybrid Retrieval: rlama 5. Running Queries Launch an interactive session: rlama run mydocs --context-size=20 rlama run mydocs --context-size=20 In the session, type your question: > How do I install the project? > How do I install the project? rlama: rlama Converts your question into an embedding. Retrieves the top matching chunks from the hybrid store. Uses the local LLM (via Ollama) to generate an answer using the retrieved context. Converts your question into an embedding. Retrieves the top matching chunks from the hybrid store. Uses the local LLM (via Ollama) to generate an answer using the retrieved context. You can exit the session by typing exit. exit 6. Using the rlama API rlama Start the API server for programmatic access: rlama api --port 11249 rlama api --port 11249 Send HTTP queries: curl -X POST http://localhost:11249/rag \ -H "Content-Type: application/json" \ -d '{ "rag_name": "mydocs", "prompt": "How do I install the project?", "context_size": 20 }' curl -X POST http://localhost:11249/rag \ -H "Content-Type: application/json" \ -d '{ "rag_name": "mydocs", "prompt": "How do I install the project?", "context_size": 20 }' The API returns a JSON response with the generated answer and diagnostic details. JSON Recent Enhancements and Tests EnhancedHybridStore Improved Document Management: Replaces the traditional vector store. Hybrid Searches: Supports both vector embeddings and textual queries. Simplified Retrieval: Quickly finds relevant documents based on user input. Improved Document Management: Replaces the traditional vector store. Improved Document Management: Hybrid Searches: Supports both vector embeddings and textual queries. Hybrid Searches: Simplified Retrieval: Quickly finds relevant documents based on user input. Simplified Retrieval: Document Struct Update Metadata Field: Now each document chunk includes a Metadata field for extra context, enhancing retrieval accuracy. Metadata Field: Now each document chunk includes a Metadata field for extra context, enhancing retrieval accuracy. Metadata Field: Metadata RagSystem Upgrade Hybrid Store Integration: All documents are now fully indexed and retrievable, resolving previous limitations. Hybrid Store Integration: All documents are now fully indexed and retrievable, resolving previous limitations. Hybrid Store Integration: Router Retrieval Testing I compared the new version with v0.1.25 using deepseek-r1:8b with the prompt: deepseek-r1:8b "list me all the routers in the code"(as simple and general as possible to verify accurate retrieval) "list me all the routers in the code"(as simple and general as possible to verify accurate retrieval) (as simple and general as possible to verify accurate retrieval) Published Version on GitHub:Answer: The code contains at least one router, CoursRouter, which is responsible for course-related routes. Additional routers for authentication and other functionalities may also exist.(Source: src/routes/coursRouter.ts) New Version:Answer: There are four routers: sgaRouter, coursRouter, questionsRouter, and devoirsRouter.(Source: src/routes/sgaRouter.ts) Published Version on GitHub:Answer: The code contains at least one router, CoursRouter, which is responsible for course-related routes. Additional routers for authentication and other functionalities may also exist.(Source: src/routes/coursRouter.ts) Published Version on GitHub:Answer: CoursRouter (Source: src/routes/coursRouter.ts) New Version:Answer: There are four routers: sgaRouter, coursRouter, questionsRouter, and devoirsRouter.(Source: src/routes/sgaRouter.ts) New Version:Answer: sgaRouter coursRouter questionsRouter devoirsRouter (Source: src/routes/sgaRouter.ts) Optimizations and Performance Tuning Retrieval Speed: Retrieval Speed: Adjust context_size to balance speed and accuracy. Use smaller models for faster embedding, or a dedicated embedding model if needed. Exclude irrelevant files during indexing to keep the index lean. Adjust context_size to balance speed and accuracy. context_size Use smaller models for faster embedding, or a dedicated embedding model if needed. Exclude irrelevant files during indexing to keep the index lean. Retrieval Accuracy: Retrieval Accuracy: Fine-tune chunk size and overlap. Moderate sizes (300-500 tokens) with 10-20% overlap work well. Use the best-suited model for your data; switch models easily with rlama update-model. Experiment with prompt tweaks if the LLM occasionally produces off-topic answers. Fine-tune chunk size and overlap. Moderate sizes (300-500 tokens) with 10-20% overlap work well. Use the best-suited model for your data; switch models easily with rlama update-model. rlama update-model Experiment with prompt tweaks if the LLM occasionally produces off-topic answers. Local Performance: Local Performance: Ensure your hardware (RAM/CPU/GPU) is sufficient for the chosen model. Leverage SSDs for faster storage and multithreading for improved inference. For batch queries, use the persistent API mode rather than restarting CLI sessions. Ensure your hardware (RAM/CPU/GPU) is sufficient for the chosen model. Leverage SSDs for faster storage and multithreading for improved inference. For batch queries, use the persistent API mode rather than restarting CLI sessions. Next Steps Optimize Chunking: Focus on enhancing the chunking process to achieve an optimal RAG, even when using small models. Monitor Performance: Continue testing with different models and configurations to find the best balance for your data and hardware. Explore Future Features: Stay tuned for upcoming hybrid retrieval enhancements and adaptive chunking features. Optimize Chunking: Focus on enhancing the chunking process to achieve an optimal RAG, even when using small models. Optimize Chunking: Monitor Performance: Continue testing with different models and configurations to find the best balance for your data and hardware. Monitor Performance: Explore Future Features: Stay tuned for upcoming hybrid retrieval enhancements and adaptive chunking features. Explore Future Features: Conclusion rlama simplifies building local RAG systems with a focus on confidentiality, performance, and ease of use. Whether you’re using a small LLM for quick responses or a larger one for in-depth analysis, rlama offers a powerful, flexible solution. With its enhanced hybrid store, improved document metadata, and upgraded RagSystem, it’s now even better at retrieving and presenting accurate answers from your data. Happy indexing and querying! rlama rlama Github repo: https://github.com/DonTizi/rlama Website: https://rlama.dev/ X :https://x.com/LeDonTizi/status/1898233014213136591 Github repo: https://github.com/DonTizi/rlama https://github.com/DonTizi/rlama Website: https://rlama.dev/ https://rlama.dev/ X :https://x.com/LeDonTizi/status/1898233014213136591 https://x.com/LeDonTizi/status/1898233014213136591