Listen to this story
This story contains new, firsthand information uncovered by the writer.
This story contains AI-generated text. The author has used AI either for research, to generate outlines, or write the text itself.
Retrieval-Augmented Generation (RAG) enhances LLMs by retrieving relevant document snippets to improve responses. With rlama
, you can build a fully local, offline RAG system—no cloud services, no dependencies, and complete data privacy. While rlama
supports both large and small LLMs, it is especially optimized for smaller models without sacrificing flexibility for larger ones.
rlama
In RAG, a knowledge store is queried to retrieve pertinent documents added to the LLM prompt. This helps ground the model’s output with factual, up-to-date data. Traditional RAG setups require multiple components (document loaders, text splitters, vector databases, etc.), but rlama
streamlines the entire process with a single CLI tool.
It handles:
This local-first approach ensures privacy, speed, and ease of management.
rlama
Ensure you have Ollama installed. Then, run:
curl -fsSL https://raw.githubusercontent.com/dontizi/rlama/main/install.sh | sh
Verify the installation:
rlama --version
Index your documents by creating a RAG store (hybrid vector store):
rlama rag <model> <rag-name> <folder-path>
For example, using a model like deepseek-r1:8b
:
rlama rag deepseek-r1:8b mydocs ./docs
This command:
~/.rlama/mydocs
).Keep your index updated:
Add Documents:
rlama add-docs mydocs ./new_docs --exclude-ext=.log
List Documents:
rlama list-docs mydocs
Inspect Chunks:
rlama list-chunks mydocs --document=filename
Update Model:
rlama update-model mydocs <new-model>
--context-size
flag controls how many chunks are retrieved per query (default is 20). For concise queries, 5-10 chunks might be sufficient, while broader questions might require 30 or more. Ensure the total token count (chunks + query) stays within your LLM’s limit.
rlama
primarily uses dense vector search, it stores the original text to support textual queries. This means you get both semantic matching and the ability to reference specific text snippets.Launch an interactive session:
rlama run mydocs --context-size=20
In the session, type your question:
> How do I install the project?
rlama
:
You can exit the session by typing exit
.
rlama
APIStart the API server for programmatic access:
rlama api --port 11249
Send HTTP queries:
curl -X POST http://localhost:11249/rag \
-H "Content-Type: application/json" \
-d '{
"rag_name": "mydocs",
"prompt": "How do I install the project?",
"context_size": 20
}'
The API returns a JSON response with the generated answer and diagnostic details.
Metadata
field for extra context, enhancing retrieval accuracy.I compared the new version with v0.1.25 using deepseek-r1:8b
with the prompt:
"list me all the routers in the code"(as simple and general as possible to verify accurate retrieval)
CoursRouter
, which is responsible for course-related routes. Additional routers for authentication and other functionalities may also exist.(Source: src/routes/coursRouter.ts)sgaRouter
, coursRouter
, questionsRouter
, and devoirsRouter
.(Source: src/routes/sgaRouter.ts)Retrieval Speed:
context_size
to balance speed and accuracy.
Retrieval Accuracy:
rlama update-model
.
Local Performance:
rlama
simplifies building local RAG systems with a focus on confidentiality, performance, and ease of use. Whether you’re using a small LLM for quick responses or a larger one for in-depth analysis, rlama
offers a powerful, flexible solution. With its enhanced hybrid store, improved document metadata, and upgraded RagSystem, it’s now even better at retrieving and presenting accurate answers from your data. Happy indexing and querying!