Keeping your sensitive documents safe while leveraging the power of Large Language Models: A complete guide to building a private, offline RAG system. Keeping your sensitive documents safe while leveraging the power of Large Language Models: A complete guide to building a private, offline RAG system. As the adoption of Artificial Intelligence(AI) increases across industries, the critical need for control, safety, and privacy has never been more apparent. In this post, I explore how to build a complete Retrieval Augmented Generation(RAG) system that runs entirely offline with Ollama, keeping the full power of Large Language Models(LLMs) while keeping sensitive information exactly where it belongs, that is, on your machine. Cloud-based AI services come with convenience, but there's a fundamental trade-off: your data has to leave your control. Why does this matter? Consider the reality many professionals face: legal documents requiring confidentiality, medical records protected by privacy laws, proprietary research that cannot be shared, or internal company communications that must stay internal. You put them at risk whenever you upload these documents to online RAG systems. Every API call, every embedding request, and every query sends your sensitive content to servers you can't verify for storage, access rights, or actual deletion. A RAG system answers questions by retrieving relevant information from your documents and using that context to generate accurate responses. The traditional approach relies on cloud services for three critical operations: converting documents to embeddings, storing those embeddings in vector databases, and generating answers through API calls. Each step exposes your data. This hands-on tutorial demonstrates that none of these compromises is necessary. What is RAG? Retrieval-Augmented Generation combines three essential processes. Retrieval finds relevant information from your documents through semantic search. Augmented means using that retrieved information to enhance the AI's knowledge beyond its training data. Generation creates accurate, cited answers based specifically on your documents rather than the model's general knowledge. RAG represents a breakthrough in making large language models more accurate and trustworthy. Unlike standard chatbots that rely solely on training data, RAG systems retrieve relevant information from your specific documents before generating answers. This ensures responses are grounded in your actual content rather than the model's general knowledge. Most implementations depend on online services for document embedding, model inference, and vector storage. This introduces privacy risks, dependency on external APIs, and higher costs. This post eliminates all external dependencies. Using Ollama for both embeddings and language generation, combined with FAISS for vector storage, this system ensures your documents never leave your computer. After the initial setup requiring internet to download models, the entire system operates offline. No API keys, no usage costs, no privacy concerns, and no internet required. You maintain complete control while preserving the full capability of modern LLMs. 1.1 How does it work? 1.1 How does it work? Traditional LLMs answer from memory based on training data, while RAG systems retrieve information from documents first, then generate answers based on that retrieved content. The process has two phases: first, documents are loaded, split into chunks, converted into numerical vectors (embeddings), and stored in a database; second, when you ask a question, it gets converted into a vector, the system finds chunks with similar vectors, and the LLM uses those chunks to generate an answer with citations. Offline Approach Offline Approach The system uses five components that run entirely on local machine. The Document Loader extracts text from files while preserving page numbers for citations. The Text Chunker splits documents into segments to maintain context. The Embedder uses Ollama with nomic-embed-text to convert text into vectors that capture semantic meaning. The Vector Database uses FAISS with cosine similarity to store vectors and find semantically similar chunks in milliseconds. Finally, the LLM uses Llama 3.2 to read retrieved chunks and generate factual answers. 1.2 Prerequisites 1.2 Prerequisites System Requirements: System Requirements: Python 3.8 or higher 8GB RAM minimum (16GB recommended) 10GB free disk space (for models) Windows, macOS, or Linux Ollama Requirements: Ollama Requirements: Download package size: 1GB Windows: Windows 10 or later macOS: macOS 14 Sonoma or later Installation Installation Step 1: Install Ollama Download from https://ollama.com/download https://ollama.com/download Or for Linux, run: curl -fsSL https://ollama.com/install.sh | sh curl -fsSL https://ollama.com/install.sh | sh Verify installation: ollama --version ollama --version Step 2: Download Models I will be using Llama 3.2 for this tutorial because the model size is 2GB, making it suitable for basic local machine use. Other models can also be used. Download the LLM (takes approximately 4 minutes, 2GB): ollama pull llama3.2 ollama pull llama3.2 Download the embedding model (takes about 2 minutes, 274MB): ollama pull nomic-embed-text ollama pull nomic-embed-text Step 3: Prepare Documents Step 3: Prepare Documents Create a `documents` folder and add your PDF, Markdown, or HTML files. For this tutorial, I will be using FLoRA: Fused forward-backward adapters for parameter-efficient fine-tuning and reducing inference-time latencies of LLMs. This is a 10-page paper that explores how FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs) by proposing a family of fused forward-backward adapters (FFBA). The PDF file can be downloaded from https://arxiv.org/pdf/2511.00050 https://arxiv.org/pdf/2511.00050 https://arxiv.org/pdf/2511.00050 Step 4: Install Python Packages Step 4: Install Python Packages Create a requirements.txt file with the following packages: faiss-cpu numpy PyPDF2 beautifulsoup4 markdown faiss-cpu numpy PyPDF2 beautifulsoup4 markdown Why these packages? `faiss-cpu` is a fast vector search library that works offline, `numpy` is the industry standard for arrays, `PyPDF2` for PDF text extraction, `beautifulsoup4` for HTML parser and `markdown` for Markdown to text conversion. Install all packages: pip install -r requirements.txt pip install -r requirements.txt Once the prerequisites have been completed, internet connectivity is no longer required. The system operates entirely offline and locally on your machine. All components run without any external network calls or cloud services. 2.0 Implementation 2.0 Implementation This section provides the complete step-by-step implementation guide to build the offline RAG system from scratch. The implementation covers all essential components including configuring Ollama with the necessary models and the complete code walkthrough for each component. Each code section includes detailed explanations of design decisions, parameter choices, and the reasoning behind specific implementations that emerged from testing with the FLoRA research paper. The guide is structured to be followed sequentially, ensuring that by the end, you will have a fully functional RAG system capable of answering questions from your documents with accurate citations, all running completely offline on your local machine. 2.1 Import Libraries 2.1 Import Libraries import os import json import subprocess import re from pathlib import Path from typing import List, Dict, Tuple, Optional from dataclasses import dataclass #for document processing import PyPDF2 from bs4 import BeautifulSoup import markdown #for vector operations import numpy as np import faiss print("all libraries successfully imported") import os import json import subprocess import re from pathlib import Path from typing import List, Dict, Tuple, Optional from dataclasses import dataclass #for document processing import PyPDF2 from bs4 import BeautifulSoup import markdown #for vector operations import numpy as np import faiss print("all libraries successfully imported") 2.2 Dataclass setup 2.2 Dataclass setup The Chunk dataclass is a structured container that holds everything needed to represent a piece of document in the RAG system. Each chunk consists of four essential components: an id for unique identification (like "flora.pdf_0" for the first chunk), the actual text content extracted from the document, a vector that stores the 768-dimensional numerical embedding (initially None until processed by the embedder), and metadata containing source information such as filename, page number, and chunk index for accurate citations. @dataclass class Chunk: """Text chunk with metadata and embedding.""" id: str text: str vector: Optional[np.ndarray] metadata: Dict @dataclass class Chunk: """Text chunk with metadata and embedding.""" id: str text: str vector: Optional[np.ndarray] metadata: Dict 2.3 Document loading 2.3 Document loading The document class was implemented for page-level tracking, i.e to enable precise citations. The DocumentLoader class handles reading multiple file formats and extracting text while preserving source information for accurate citations. It uses static methods because document loading does not require instance state, making the class a clean organizational tool for related functions. The class supports three formats: PDF files are processed page by page using PyPDF2 to enable precise page-level citations, Markdown files are converted to HTML then to plain text using the markdown and BeautifulSoup libraries to cleanly remove formatting syntax. HTML files have scripts and styles stripped using BeautifulSoup before text extraction. Each loader method returns a list of dictionaries containing the extracted text and metadata (source filename, page number, and document type), with the main load_documents method automatically detecting file types by extension and routing them to the appropriate loader. class DocumentLoader: """load PDF, Markdown, and HTML documents.""" @staticmethod def load_pdf(file_path: str) -> List[Dict]: """extract text from PDF, page by page for citations.""" chunks = [] try: with open(file_path, 'rb') as file: pdf_reader = PyPDF2.PdfReader(file) for page_num, page in enumerate(pdf_reader.pages): text = page.extract_text() if text.strip(): chunks.append({ 'text': text, 'metadata': { 'source': os.path.basename(file_path), 'page': page_num + 1, 'type': 'pdf' } }) except Exception as e: print(f"error loading PDF {file_path}: {e}") return chunks @staticmethod def load_markdown(file_path: str) -> List[Dict]: """convert markdown to text via HTML.""" try: with open(file_path, 'r', encoding='utf-8') as file: md_content = file.read() html = markdown.markdown(md_content) soup = BeautifulSoup(html, 'html.parser') text = soup.get_text() return [{ 'text': text, 'metadata': { 'source': os.path.basename(file_path), 'page': 1, 'type': 'markdown' } }] except Exception as e: print(f"error loading markdown {file_path}: {e}") return [] @staticmethod def load_html(file_path: str) -> List[Dict]: """extract text from HTML, removing scripts and styles.""" try: with open(file_path, 'r', encoding='utf-8') as file: soup = BeautifulSoup(file.read(), 'html.parser') for script in soup(["script", "style"]): script.decompose() text = soup.get_text() return [{ 'text': text, 'metadata': { 'source': os.path.basename(file_path), 'page': 1, 'type': 'html' } }] except Exception as e: print(f"error loading HTML {file_path}: {e}") return [] @staticmethod def load_documents(directory: str) -> List[Dict]: """ load all supported documents from a directory.""" documents = [] doc_dir = Path(directory) if not doc_dir.exists(): print(f"Creating {directory}...") doc_dir.mkdir(parents=True) print(f"add documents to {directory} and run again.") return documents for file_path in doc_dir.rglob('*'): if file_path.is_file(): ext = file_path.suffix.lower() if ext == '.pdf': documents.extend(DocumentLoader.load_pdf(str(file_path))) elif ext in ['.md', '.markdown']: documents.extend(DocumentLoader.load_markdown(str(file_path))) elif ext in ['.html', '.htm']: documents.extend(DocumentLoader.load_html(str(file_path))) print(f"loaded {len(documents)} document sections") return documents print("document loader ready!") class DocumentLoader: """load PDF, Markdown, and HTML documents.""" @staticmethod def load_pdf(file_path: str) -> List[Dict]: """extract text from PDF, page by page for citations.""" chunks = [] try: with open(file_path, 'rb') as file: pdf_reader = PyPDF2.PdfReader(file) for page_num, page in enumerate(pdf_reader.pages): text = page.extract_text() if text.strip(): chunks.append({ 'text': text, 'metadata': { 'source': os.path.basename(file_path), 'page': page_num + 1, 'type': 'pdf' } }) except Exception as e: print(f"error loading PDF {file_path}: {e}") return chunks @staticmethod def load_markdown(file_path: str) -> List[Dict]: """convert markdown to text via HTML.""" try: with open(file_path, 'r', encoding='utf-8') as file: md_content = file.read() html = markdown.markdown(md_content) soup = BeautifulSoup(html, 'html.parser') text = soup.get_text() return [{ 'text': text, 'metadata': { 'source': os.path.basename(file_path), 'page': 1, 'type': 'markdown' } }] except Exception as e: print(f"error loading markdown {file_path}: {e}") return [] @staticmethod def load_html(file_path: str) -> List[Dict]: """extract text from HTML, removing scripts and styles.""" try: with open(file_path, 'r', encoding='utf-8') as file: soup = BeautifulSoup(file.read(), 'html.parser') for script in soup(["script", "style"]): script.decompose() text = soup.get_text() return [{ 'text': text, 'metadata': { 'source': os.path.basename(file_path), 'page': 1, 'type': 'html' } }] except Exception as e: print(f"error loading HTML {file_path}: {e}") return [] @staticmethod def load_documents(directory: str) -> List[Dict]: """ load all supported documents from a directory.""" documents = [] doc_dir = Path(directory) if not doc_dir.exists(): print(f"Creating {directory}...") doc_dir.mkdir(parents=True) print(f"add documents to {directory} and run again.") return documents for file_path in doc_dir.rglob('*'): if file_path.is_file(): ext = file_path.suffix.lower() if ext == '.pdf': documents.extend(DocumentLoader.load_pdf(str(file_path))) elif ext in ['.md', '.markdown']: documents.extend(DocumentLoader.load_markdown(str(file_path))) elif ext in ['.html', '.htm']: documents.extend(DocumentLoader.load_html(str(file_path))) print(f"loaded {len(documents)} document sections") return documents print("document loader ready!") 2.4 Text Chunking 2.4 Text Chunking Chunking is the process of splitting text data into smaller segments. It is required because embedding models have token limits and cannot process an entire document at once, which may result in loss of context and failure to capture complete concepts. Overlapping prevents information loss at boundaries. For example: “…safety protocol. First, wear PPE…”. Without overlap in this instance, “First, wear PPE” loses context. With Overlap, the previous chunk includes “safety protocol”. Sentence boundary detection breaks text at periods, not mid-sentence, to promote better comprehension by the LLM. For this solution, due to the number of pages of the document that was used for testing, the chunk size was set at 750 and the overlap at 100. 100 characters represent approximately 15-20 words, enough to capture sentence endings and beginnings. This prevents information loss at boundaries while limiting redundancy. Note that chunk size and overlap can be modified depending on the task at hand. Smaller chunks work for fact-finding and larger chunks are good for broader comprehension. Note that chunk size and overlap can be modified depending on the task at hand. Smaller chunks work for fact-finding and larger chunks are good for broader comprehension. The `TextChunker` class uses static methods for stateless text processing operations. The `clean_text` method normalizes the input by converting multiple spaces to single spaces and removing special characters while preserving punctuation needed for sentence detection. The `chunk_text` method implements the core algorithm: it starts at position 0 and calculates a tentative end position by adding `chunk_size`, then searches the last 20 percent of the chunk for sentence endings (periods, exclamation points, question marks) using `rfind` to avoid mid-sentence breaks, adjusts the end position to the sentence boundary if found, extracts the chunk text and creates a Chunk object with a unique ID combining source filename and index, then moves the start position forward by `chunk_size` minus overlap to create the next chunk. The algorithm continues until reaching the end of the text, with each iteration adding chunk metadata, including the index for tracking and the source information for citations. For the FLoRA paper, this produced chunks averaging 6.1 per page, with each chunk maintaining its connection to the original page number for precise citation, like "Source 5, Page 1" in query responses. class TextChunker: """text chunking with overlap and sentence boundaries.""" @staticmethod def clean_text(text: str) -> str: """normalize whitespace and remove special characters.""" text = re.sub(r'\s+', ' ', text) #multiple spaces to single space text = re.sub(r'[^\w\s\.\,\!\?\-\:\;]', '', text) #keep punctuation return text.strip() @staticmethod def chunk_text( text: str, chunk_size: int = 750, overlap: int = 100, metadata: Dict = None ) -> List[Chunk]: """split text into overlapping chunks at sentence boundaries. Args: chunk_size: target size (≈150 tokens for embeddings) overlap: overlap size to preserve context metadata: source info for citations """ text = TextChunker.clean_text(text) chunks = [] if not text: return chunks start = 0 chunk_index = 0 while start < len(text): end = start + chunk_size #break at sentence boundary(last 20% of chunk) if end < len(text): search_start = end - int(chunk_size * 0.2) sentence_end = max( text.rfind('.', search_start, end), text.rfind('!', search_start, end), text.rfind('?', search_start, end) ) if sentence_end != -1 and sentence_end > start: end = sentence_end + 1 chunk_text = text[start:end].strip() if chunk_text: chunk_metadata = metadata.copy() if metadata else {} chunk_metadata['chunk_index'] = chunk_index chunk_id = f"{chunk_metadata.get('source', 'unknown')}_{chunk_index}" chunks.append(Chunk( id=chunk_id, text=chunk_text, vector=None, metadata=chunk_metadata )) chunk_index += 1 start = end - overlap #move with overlap if start >= len(text) - overlap: break return chunks print("text chunker ready!") class TextChunker: """text chunking with overlap and sentence boundaries.""" @staticmethod def clean_text(text: str) -> str: """normalize whitespace and remove special characters.""" text = re.sub(r'\s+', ' ', text) #multiple spaces to single space text = re.sub(r'[^\w\s\.\,\!\?\-\:\;]', '', text) #keep punctuation return text.strip() @staticmethod def chunk_text( text: str, chunk_size: int = 750, overlap: int = 100, metadata: Dict = None ) -> List[Chunk]: """split text into overlapping chunks at sentence boundaries. Args: chunk_size: target size (≈150 tokens for embeddings) overlap: overlap size to preserve context metadata: source info for citations """ text = TextChunker.clean_text(text) chunks = [] if not text: return chunks start = 0 chunk_index = 0 while start < len(text): end = start + chunk_size #break at sentence boundary(last 20% of chunk) if end < len(text): search_start = end - int(chunk_size * 0.2) sentence_end = max( text.rfind('.', search_start, end), text.rfind('!', search_start, end), text.rfind('?', search_start, end) ) if sentence_end != -1 and sentence_end > start: end = sentence_end + 1 chunk_text = text[start:end].strip() if chunk_text: chunk_metadata = metadata.copy() if metadata else {} chunk_metadata['chunk_index'] = chunk_index chunk_id = f"{chunk_metadata.get('source', 'unknown')}_{chunk_index}" chunks.append(Chunk( id=chunk_id, text=chunk_text, vector=None, metadata=chunk_metadata )) chunk_index += 1 start = end - overlap #move with overlap if start >= len(text) - overlap: break return chunks print("text chunker ready!") 2.5 Embedding 2.5 Embedding Embeddings convert text into a list of 768 numbers that capture the meaning of the text, allowing the computer to mathematically compare how similar different pieces of text are. For example, "parameter efficient fine-tuning" and "PEFT methods for LLMs" would produce similar number patterns even though the words are different, because they mean similar things. The system uses `Ollama` with the `nomic-embed-text model`, which runs completely on your computer without needing internet or cloud services, ensuring your documents stay private. The model is lightweight at 274 MB and provides good accuracy for general text, taking about 2-3 minutes to process 61 chunks on a regular CPU. The embedder shows progress every 10 chunks so you can see it is working during the initial setup. Code: Code: class OllamaEmbedder: """Generate embeddings using Ollama's embedding model.""" def __init__(self, model_name: str = "nomic-embed-text"): self.model_name = model_name self._verify_model() def _verify_model(self): """Check if model is available locally.""" try: result = subprocess.run( ['ollama', 'list'], capture_output=True, text=True, check=True ) if self.model_name not in result.stdout: raise RuntimeError( f"Model '{self.model_name}' not found locally.\n" f"Please download it first using:\n" f" ollama pull {self.model_name}\n" f"This is a one-time setup step that requires internet connection." ) print(f"Found embedding model: {self.model_name}") except subprocess.CalledProcessError as e: raise RuntimeError( f"Cannot connect to Ollama service.\n" f"Please ensure Ollama is installed and running.\n" f"Error: {e}" ) except FileNotFoundError: raise RuntimeError( "Ollama not found on your system.\n" "Please install Ollama from: https://ollama.com/download\n" "This is a one-time setup step." ) def embed_text(self, text: str) -> np.ndarray: """generate embedding vector for text using HTTP API.""" try: import http.client conn = http.client.HTTPConnection("localhost", 11434, timeout=30) headers = {'Content-Type': 'application/json'} payload = json.dumps({ "model": self.model_name, "prompt": text }) conn.request("POST", "/api/embeddings", payload, headers) response = conn.getresponse() data = json.loads(response.read().decode()) return np.array(data['embedding'], dtype=np.float32) except Exception as e: print(f"Embedding error: {e}") return np.zeros(768, dtype=np.float32) # Fallback def embed_chunks(self, chunks: List[Chunk]) -> List[Chunk]: """generate embeddings for all chunks with progress.""" print(f"Generating embeddings for {len(chunks)} chunks...") for i, chunk in enumerate(chunks): if i % 10 == 0 and i > 0: print(f" progress: {i}/{len(chunks)}") chunk.vector = self.embed_text(chunk.text) print("embeddings complete!") return chunks class OllamaEmbedder: """Generate embeddings using Ollama's embedding model.""" def __init__(self, model_name: str = "nomic-embed-text"): self.model_name = model_name self._verify_model() def _verify_model(self): """Check if model is available locally.""" try: result = subprocess.run( ['ollama', 'list'], capture_output=True, text=True, check=True ) if self.model_name not in result.stdout: raise RuntimeError( f"Model '{self.model_name}' not found locally.\n" f"Please download it first using:\n" f" ollama pull {self.model_name}\n" f"This is a one-time setup step that requires internet connection." ) print(f"Found embedding model: {self.model_name}") except subprocess.CalledProcessError as e: raise RuntimeError( f"Cannot connect to Ollama service.\n" f"Please ensure Ollama is installed and running.\n" f"Error: {e}" ) except FileNotFoundError: raise RuntimeError( "Ollama not found on your system.\n" "Please install Ollama from: https://ollama.com/download\n" "This is a one-time setup step." ) def embed_text(self, text: str) -> np.ndarray: """generate embedding vector for text using HTTP API.""" try: import http.client conn = http.client.HTTPConnection("localhost", 11434, timeout=30) headers = {'Content-Type': 'application/json'} payload = json.dumps({ "model": self.model_name, "prompt": text }) conn.request("POST", "/api/embeddings", payload, headers) response = conn.getresponse() data = json.loads(response.read().decode()) return np.array(data['embedding'], dtype=np.float32) except Exception as e: print(f"Embedding error: {e}") return np.zeros(768, dtype=np.float32) # Fallback def embed_chunks(self, chunks: List[Chunk]) -> List[Chunk]: """generate embeddings for all chunks with progress.""" print(f"Generating embeddings for {len(chunks)} chunks...") for i, chunk in enumerate(chunks): if i % 10 == 0 and i > 0: print(f" progress: {i}/{len(chunks)}") chunk.vector = self.embed_text(chunk.text) print("embeddings complete!") return chunks The `OllamaEmbedder` class manages the embedding process through three main methods. When initialized, the `_verify_model` method checks if `nomic-embed-text` is installed by running the "ollama list" command, and notifies the user if missing. The `embed_text` method is the core function that converts text to numbers by sending an HTTP request to localhost port 11434, where `Ollama` runs as a background service, receiving back a list of 768 numbers (the embedding) that represent the text's meaning, with a fallback that returns zeros if something goes wrong. The `embed_chunks` method processes all document chunks by calling `embed_text` for each one, displaying progress every 10 chunks, and storing the resulting vector in each chunk's vector field. 2.6 Vector Database setup 2.6 Vector Database setup A vector database stores the numerical embeddings and enables fast searching to find similar chunks when you ask a question. The system uses FAISS (Facebook AI Similarity Search), which is a specialized library designed to quickly search through millions of vectors. The critical design choice was using cosine similarity instead of Euclidean distance. The database stores vectors in a binary index file for speed and chunk metadata in a JSON file for readability, and it can save to disk after initial setup, so subsequent runs load in seconds instead of minutes. For the FLoRA paper, the database stored 61 chunk vectors and successfully retrieved the 5 most relevant chunks when queried, with a distance threshold of 0.6 acting as a quality filter to exclude irrelevant content. class VectorDatabase: """FAISS-based vector storage and retrieval with Cosine Similarity.""" def __init__(self, dimension: int = 768): self.dimension = dimension #ue IndexFlatIP for cosine similarity self.index = faiss.IndexFlatIP(dimension) self.chunks: List[Chunk] = [] def add_chunks(self, chunks: List[Chunk]): """add chunk embeddings to the index.""" vectors = np.array([chunk.vector for chunk in chunks], dtype=np.float32) #normalize vectors for cosine similarity faiss.normalize_L2(vectors) self.index.add(vectors) self.chunks.extend(chunks) print(f"added {len(chunks)} chunks (total: {len(self.chunks)})") def search(self, query_vector: np.ndarray, top_k: int = 5) -> List[Tuple[Chunk, float]]: """find top-k most similar chunks using cosine similarity. Returns: List of (chunk, distance) tuples Distance is (1 - cosine_similarity), so lower = more similar """ query_vector = query_vector.reshape(1, -1).astype(np.float32) #normalize query vector for cosine similarity faiss.normalize_L2(query_vector) #search (returns similarity scores, not distances) similarities, indices = self.index.search(query_vector, top_k) results = [] for idx, similarity in zip(indices[0], similarities[0]): if idx < len(self.chunks): #convert similarity to distance: distance = 1 - similarity distance = 1 - similarity results.append((self.chunks[idx], float(distance))) return results def save(self, directory: str): """persist database to disk.""" os.makedirs(directory, exist_ok=True) #save FAISS index faiss.write_index(self.index, os.path.join(directory, 'faiss.index')) #save chunks metadata (JSON) chunks_data = [{'id': chunk.id, 'text': chunk.text, 'metadata': chunk.metadata } for chunk in self.chunks] with open(os.path.join(directory, 'chunks.json'), 'w', encoding='utf-8') as f: json.dump(chunks_data, f, indent=2) print(f"database saved to {directory}") def load(self, directory: str, embedder) -> bool: """load database from disk.""" index_path = os.path.join(directory, 'faiss.index') chunks_path = os.path.join(directory, 'chunks.json') if not os.path.exists(index_path) or not os.path.exists(chunks_path): print(f"no database found in {directory}") return False #load FAISS index self.index = faiss.read_index(index_path) #load chunks with open(chunks_path, 'r', encoding='utf-8') as f: chunks_data = json.load(f) #reconstruct chunks (re-embed for consistency) print("reconstructing chunk vectors...") self.chunks = [] for data in chunks_data: chunk = Chunk( id=data['id'], text=data['text'], vector=embedder.embed_text(data['text']), metadata=data['metadata'] ) self.chunks.append(chunk) print(f"database loaded: {len(self.chunks)} chunks") return True class VectorDatabase: """FAISS-based vector storage and retrieval with Cosine Similarity.""" def __init__(self, dimension: int = 768): self.dimension = dimension #ue IndexFlatIP for cosine similarity self.index = faiss.IndexFlatIP(dimension) self.chunks: List[Chunk] = [] def add_chunks(self, chunks: List[Chunk]): """add chunk embeddings to the index.""" vectors = np.array([chunk.vector for chunk in chunks], dtype=np.float32) #normalize vectors for cosine similarity faiss.normalize_L2(vectors) self.index.add(vectors) self.chunks.extend(chunks) print(f"added {len(chunks)} chunks (total: {len(self.chunks)})") def search(self, query_vector: np.ndarray, top_k: int = 5) -> List[Tuple[Chunk, float]]: """find top-k most similar chunks using cosine similarity. Returns: List of (chunk, distance) tuples Distance is (1 - cosine_similarity), so lower = more similar """ query_vector = query_vector.reshape(1, -1).astype(np.float32) #normalize query vector for cosine similarity faiss.normalize_L2(query_vector) #search (returns similarity scores, not distances) similarities, indices = self.index.search(query_vector, top_k) results = [] for idx, similarity in zip(indices[0], similarities[0]): if idx < len(self.chunks): #convert similarity to distance: distance = 1 - similarity distance = 1 - similarity results.append((self.chunks[idx], float(distance))) return results def save(self, directory: str): """persist database to disk.""" os.makedirs(directory, exist_ok=True) #save FAISS index faiss.write_index(self.index, os.path.join(directory, 'faiss.index')) #save chunks metadata (JSON) chunks_data = [{'id': chunk.id, 'text': chunk.text, 'metadata': chunk.metadata } for chunk in self.chunks] with open(os.path.join(directory, 'chunks.json'), 'w', encoding='utf-8') as f: json.dump(chunks_data, f, indent=2) print(f"database saved to {directory}") def load(self, directory: str, embedder) -> bool: """load database from disk.""" index_path = os.path.join(directory, 'faiss.index') chunks_path = os.path.join(directory, 'chunks.json') if not os.path.exists(index_path) or not os.path.exists(chunks_path): print(f"no database found in {directory}") return False #load FAISS index self.index = faiss.read_index(index_path) #load chunks with open(chunks_path, 'r', encoding='utf-8') as f: chunks_data = json.load(f) #reconstruct chunks (re-embed for consistency) print("reconstructing chunk vectors...") self.chunks = [] for data in chunks_data: chunk = Chunk( id=data['id'], text=data['text'], vector=embedder.embed_text(data['text']), metadata=data['metadata'] ) self.chunks.append(chunk) print(f"database loaded: {len(self.chunks)} chunks") return True The `VectorDatabase` class initializes with dimension set to 768 to match the embedding size and creates a FAISS IndexFlatIP index for inner product similarity after normalization. The `add_chunks` method takes a list of chunks, extracts their vectors into a numpy array, normalizes them using `faiss.normalize_L2` (which scales each vector to unit length so inner product equals cosine similarity), then adds them to the index and stores the chunks for later retrieval. The search method takes a query vector, reshapes and normalizes it the same way, calls index.search to find the top-k most similar vectors (which returns similarity scores from 0 to 1), then converts those scores to distances by subtracting from 1 so that lower numbers mean better matches, matching the chunk objects with their distance scores. The save method writes the FAISS index to a binary file and the chunk text and metadata to a JSON file for persistence, while the load method reads these files back and reconstructs the chunks by re-embedding their text to ensure vector consistency. 2.7 The Large Language Model 2.7 The Large Language Model The LLM is the component that reads the retrieved chunks and generates natural language answers in response to questions. The system uses Llama 3.2, a 2GB model that works well on regular computers without requiring a GPU. The implementation uses command-line execution through subprocess instead of HTTP API because this proved more reliable on CPU systems where HTTP requests would timeout. The temperature setting of 0.3 controls how creative or focused the answers are, with lower values producing more factual, deterministic responses that stay close to the retrieved context, which is ideal for question answering. The 5-minute timeout accommodates slow CPU processing, where the first query takes 30-90 seconds to load the model into memory, but subsequent queries complete in 3-10 seconds. For the FLoRA paper, the model generated 416 characters when asked about the main topic and 283 characters for the specific problem question, demonstrating it provides substantive, well-explained answers while remaining concise. Code: Code: class OllamaLLM: """LLM interface using Ollama CLI (more reliable on CPU).""" def __init__(self, model_name: str = "llama3.2"): self.model_name = model_name self._verify_model() def _verify_model(self): """Check if model is available locally.""" try: result = subprocess.run( ['ollama', 'list'], capture_output=True, text=True, check=True ) if self.model_name not in result.stdout: raise RuntimeError( f"Model '{self.model_name}' not found locally.\n" f"Please download it first using:\n" f" ollama pull {self.model_name}\n" f"This is a one-time setup step that requires internet connection." ) print(f"Found LLM model: {self.model_name}") except subprocess.CalledProcessError as e: raise RuntimeError( f"Cannot connect to Ollama service.\n" f"Please ensure Ollama is installed and running.\n" f"Error: {e}" ) except FileNotFoundError: raise RuntimeError( "Ollama not found on your system.\n" "Please install Ollama from: https://ollama.com/download\n" "This is a one-time setup step." ) def generate(self, prompt: str, temperature: float = 0.3) -> str: """generate response using Ollama CLI (more reliable on CPU). Args: prompt: Complete prompt with context and question temperature: Creativity (0.0=deterministic, 1.0=creative) """ try: print(f" Generating with {self.model_name} ...") # Use subprocess with CLI - more reliable than HTTP on CPU result = subprocess.run( ['ollama', 'run', self.model_name], input=prompt, capture_output=True, text=True, timeout=300, # 5 minutes timeout encoding='utf-8' ) if result.returncode != 0: error_msg = result.stderr or "Unknown error" print(f" Ollama error: {error_msg}") return f"Error: {error_msg}" answer = result.stdout.strip() if not answer: print(f" Empty response") return "Error: Empty response from LLM" print(f" Generated {len(answer)} characters") return answer except subprocess.TimeoutExpired: print(f" Timeout after 5 minutes") return "Error: Generation timed out. Try a simpler question or smaller context." except Exception as e: error_msg = f"Error: {str(e)}" print(f" {error_msg}") return error_msg class OllamaLLM: """LLM interface using Ollama CLI (more reliable on CPU).""" def __init__(self, model_name: str = "llama3.2"): self.model_name = model_name self._verify_model() def _verify_model(self): """Check if model is available locally.""" try: result = subprocess.run( ['ollama', 'list'], capture_output=True, text=True, check=True ) if self.model_name not in result.stdout: raise RuntimeError( f"Model '{self.model_name}' not found locally.\n" f"Please download it first using:\n" f" ollama pull {self.model_name}\n" f"This is a one-time setup step that requires internet connection." ) print(f"Found LLM model: {self.model_name}") except subprocess.CalledProcessError as e: raise RuntimeError( f"Cannot connect to Ollama service.\n" f"Please ensure Ollama is installed and running.\n" f"Error: {e}" ) except FileNotFoundError: raise RuntimeError( "Ollama not found on your system.\n" "Please install Ollama from: https://ollama.com/download\n" "This is a one-time setup step." ) def generate(self, prompt: str, temperature: float = 0.3) -> str: """generate response using Ollama CLI (more reliable on CPU). Args: prompt: Complete prompt with context and question temperature: Creativity (0.0=deterministic, 1.0=creative) """ try: print(f" Generating with {self.model_name} ...") # Use subprocess with CLI - more reliable than HTTP on CPU result = subprocess.run( ['ollama', 'run', self.model_name], input=prompt, capture_output=True, text=True, timeout=300, # 5 minutes timeout encoding='utf-8' ) if result.returncode != 0: error_msg = result.stderr or "Unknown error" print(f" Ollama error: {error_msg}") return f"Error: {error_msg}" answer = result.stdout.strip() if not answer: print(f" Empty response") return "Error: Empty response from LLM" print(f" Generated {len(answer)} characters") return answer except subprocess.TimeoutExpired: print(f" Timeout after 5 minutes") return "Error: Generation timed out. Try a simpler question or smaller context." except Exception as e: error_msg = f"Error: {str(e)}" print(f" {error_msg}") return error_msg The OllamaLLM class initializes with the model name (defaulting to llama3.2) and immediately verifies the model is installed. The _verify_model method runs "ollama list" to check if the model exists locally and notifies the user if missing. The generate method is the core function that takes a prompt containing the context and question, then uses subprocess.run to execute the command "ollama run llama3.2" with the prompt sent through standard input, capturing both output and errors with a 5-minute timeout to handle slow CPU inference. The method checks if the command succeeded (returncode equals 0), extracts the answer from standard output and strips whitespace, handles errors by returning descriptive messages, and prints the character count of generated text for monitoring. For the FLoRA paper queries, when given a prompt like "Answer based on context: [5 chunks about PEFT] Question: What problem does FLoRA address?", the model read through the provided chunks and generated a focused 283-character answer citing Source 5 from page 1, demonstrating it successfully grounded its response in the retrieved context rather than guessing from general knowledge. 2.7 Main RAG System Architecture 2.7 Main RAG System Architecture The complete RAG system orchestrates all components through three main phases. The Ingest phase loads documents, splits them into chunks, converts chunks to embeddings, and stores them in the vector database (done once during setup). The Query phase takes question, converts it to an embedding, searches the vector database for similar chunks, and retrieves the most relevant ones. The Generate phase builds a structured prompt containing the retrieved chunks as context along with your question and strict instructions, sends it to the LLM, and returns the answer with source citations. The system uses a distance threshold of 0.6 to filter out irrelevant chunks, where lower values are stricter and may miss some relevant information, while higher values are more lenient but may include noise. The prompt engineering includes clear instructions like "Answer only from context" with a structured format of CONTEXT then QUESTION then INSTRUCTIONS, forcing the model to cite sources and refuse to guess when context is insufficient, ensuring answers are grounded in actual documents. Code: class RAGSystem: """complete RAG orchestration.""" def __init__( self, documents_dir: str = "documents", db_dir: str = "vector_db", llm_model: str = "llama3.2", embedding_model: str = "nomic-embed-text" ): self.documents_dir = documents_dir self.db_dir = db_dir print("initializing RAG System...") self.embedder = OllamaEmbedder(embedding_model) self.llm = OllamaLLM(llm_model) self.vector_db = VectorDatabase() print("RAG System initialized!") def ingest_documents( self, chunk_size: int = 750, overlap: int = 100, force_rebuild: bool = False ): """Build or load vector database.""" #try loading existing database if not force_rebuild and os.path.exists(self.db_dir): print("loading existing database...") if self.vector_db.load(self.db_dir, self.embedder): return print(" Building new database...") #load documents documents = DocumentLoader.load_documents(self.documents_dir) if not documents: print("no documents found!") return #chunk documents all_chunks = [] for doc in documents: chunks = TextChunker.chunk_text( doc['text'], chunk_size=chunk_size, overlap=overlap, metadata=doc['metadata'] ) all_chunks.extend(chunks) print(f"created {len(all_chunks)} chunks") #generate embeddings all_chunks = self.embedder.embed_chunks(all_chunks) #store in vector DB self.vector_db.add_chunks(all_chunks) #save for future use self.vector_db.save(self.db_dir) def query( self, question: str, top_k: int = 5, distance_threshold: float = 1.5 ) -> Dict: """Answer question using RAG. Returns: { 'answer': Generated answer, 'sources': List of source chunks, 'confidence': 'high'|'medium'|'low' } """ print(f"\n Question: {question}") #embed query query_vector = self.embedder.embed_text(question) #search vector DB results = self.vector_db.search(query_vector, top_k=top_k) #filter by threshold filtered_results = [ (chunk, dist) for chunk, dist in results if dist < distance_threshold ] if not filtered_results: return { 'answer': "Insufficient context to answer this question.", 'sources': [], 'confidence': 'low' } #build context from chunks context_parts = [] sources = [] for i, (chunk, distance) in enumerate(filtered_results): context_parts.append( f"[Source {i+1}: {chunk.metadata['source']}, " f"Page {chunk.metadata.get('page', 'N/A')}]\n{chunk.text}\n" ) sources.append({ 'id': chunk.id, 'source': chunk.metadata['source'], 'page': chunk.metadata.get('page', 'N/A'), 'distance': distance }) context = "\n".join(context_parts) #build prompt prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context. CONTEXT: {context} QUESTION: {question} INSTRUCTIONS: 1. Answer based only on the context above 2. Cite source numbers (e.g., "According to Source 1...") 3. If context is insufficient, state that clearly 4. Be concise but thorough ANSWER:""" #generate answer print("Generating answer...") answer = self.llm.generate(prompt, temperature=0.3) return { 'answer': answer, 'sources': sources, 'confidence': 'high' if len(filtered_results) >= 3 else 'medium' } print("RAG System class ready!") class RAGSystem: """complete RAG orchestration.""" def __init__( self, documents_dir: str = "documents", db_dir: str = "vector_db", llm_model: str = "llama3.2", embedding_model: str = "nomic-embed-text" ): self.documents_dir = documents_dir self.db_dir = db_dir print("initializing RAG System...") self.embedder = OllamaEmbedder(embedding_model) self.llm = OllamaLLM(llm_model) self.vector_db = VectorDatabase() print("RAG System initialized!") def ingest_documents( self, chunk_size: int = 750, overlap: int = 100, force_rebuild: bool = False ): """Build or load vector database.""" #try loading existing database if not force_rebuild and os.path.exists(self.db_dir): print("loading existing database...") if self.vector_db.load(self.db_dir, self.embedder): return print(" Building new database...") #load documents documents = DocumentLoader.load_documents(self.documents_dir) if not documents: print("no documents found!") return #chunk documents all_chunks = [] for doc in documents: chunks = TextChunker.chunk_text( doc['text'], chunk_size=chunk_size, overlap=overlap, metadata=doc['metadata'] ) all_chunks.extend(chunks) print(f"created {len(all_chunks)} chunks") #generate embeddings all_chunks = self.embedder.embed_chunks(all_chunks) #store in vector DB self.vector_db.add_chunks(all_chunks) #save for future use self.vector_db.save(self.db_dir) def query( self, question: str, top_k: int = 5, distance_threshold: float = 1.5 ) -> Dict: """Answer question using RAG. Returns: { 'answer': Generated answer, 'sources': List of source chunks, 'confidence': 'high'|'medium'|'low' } """ print(f"\n Question: {question}") #embed query query_vector = self.embedder.embed_text(question) #search vector DB results = self.vector_db.search(query_vector, top_k=top_k) #filter by threshold filtered_results = [ (chunk, dist) for chunk, dist in results if dist < distance_threshold ] if not filtered_results: return { 'answer': "Insufficient context to answer this question.", 'sources': [], 'confidence': 'low' } #build context from chunks context_parts = [] sources = [] for i, (chunk, distance) in enumerate(filtered_results): context_parts.append( f"[Source {i+1}: {chunk.metadata['source']}, " f"Page {chunk.metadata.get('page', 'N/A')}]\n{chunk.text}\n" ) sources.append({ 'id': chunk.id, 'source': chunk.metadata['source'], 'page': chunk.metadata.get('page', 'N/A'), 'distance': distance }) context = "\n".join(context_parts) #build prompt prompt = f"""You are a helpful AI assistant. Answer the question based ONLY on the provided context. CONTEXT: {context} QUESTION: {question} INSTRUCTIONS: 1. Answer based only on the context above 2. Cite source numbers (e.g., "According to Source 1...") 3. If context is insufficient, state that clearly 4. Be concise but thorough ANSWER:""" #generate answer print("Generating answer...") answer = self.llm.generate(prompt, temperature=0.3) return { 'answer': answer, 'sources': sources, 'confidence': 'high' if len(filtered_results) >= 3 else 'medium' } print("RAG System class ready!") The RAGSystem class brings everything together by initializing the embedder, LLM, and vector database when created. The ingest_documents method first checks if a saved database exists and loads it (taking only 2-5 seconds), otherwise builds a new one by calling DocumentLoader to read files, TextChunker to split them (creating 61 chunks for the FLoRA paper), the embedder to convert chunks to vectors (taking 2-3 minutes), adding them to the vector database, and saving everything to disk for next time. The query method implements the complete retrieval and generation flow: it converts the question to a vector using the embedder, searches the vector database for the top-k most similar chunks (5 by default), filters results to keep only chunks below the distance threshold (0.6, keeping chunks with distances like 0.21-0.56 while rejecting higher values), returns "insufficient context" if no chunks pass the threshold, builds a formatted context string with source labels like "[Source 1: flora.pdf, Page 1]" followed by the chunk text, constructs a detailed prompt with the context, question, and instructions to cite sources, sends the prompt to the LLM with temperature 0.3 for focused answers, and returns a dictionary containing the answer, source details with distances, and confidence level (high if 3 or more sources, medium otherwise). 2.8 Initialization and testing 2.8 Initialization and testing The RAG system requires configuration of several parameters before use. Complete these configurations. The documents_dir specifies where your PDF, Markdown, or HTML files are stored (defaults to "documents" folder). The db_dir sets where the vector database will be saved for fast loading in future sessions (defaults to "vector_db" folder). The chunk_size determines how many characters each text segment contains, 750 provides balanced context for this document. The overlap parameter should be 10-20% of chunk_size to prevent information loss at boundaries, so 100 characters works well with 750-character chunks. The force_rebuild flag controls whether to rebuild the database from scratch (True) or load the existing saved database if available (False, recommended after first run for speed). - `documents_dir`: Where your documents are - `db_dir`: Where vector database is saved - `chunk_size`: 500-1000 (750 is balanced) - `overlap`: 10-20% of chunk_size - `force_rebuild`: Set True to rebuild from scratch Code: Code: # Initialize RAG system rag = RAGSystem( documents_dir="documents", db_dir="vector_db", llm_model="llama3.2", embedding_model="nomic-embed-text" ) # Build/load database rag.ingest_documents( chunk_size=750, overlap=100, force_rebuild=True ) # Initialize RAG system rag = RAGSystem( documents_dir="documents", db_dir="vector_db", llm_model="llama3.2", embedding_model="nomic-embed-text" ) # Build/load database rag.ingest_documents( chunk_size=750, overlap=100, force_rebuild=True ) If you have followed till this point, your output should be similar to the output in the image below. 2.8.1 Testing with an example 2.8.1 Testing with an example Parameters used: `top_k`: more value means more context, which makes answer generation slower. Five chunks provide diverse perspectives without overwhelming the context window. For the FLoRA paper, this retrieved approximately 3750 characters of context (750 per chunk). Testing with 3 chunks sometimes missed nuanced information, while 7+ introduced redundancy and slowed generation. The system's ability to assign high confidence when retrieving 5 sources validates this choice. Multiple sources strengthen answers through cross-referencing. Distance_threshold: Lower value means stricter matching. With cosine similarity, distances range from 0 (identical) to 1 (opposite). The 0.6 threshold emerged from observing query results. Relevant chunks scored 0.2-0.56, while irrelevant content exceeded 0.7. The 0.6 cutoff effectively separates signal from noise. For unanswerable questions, all chunks exceeded this threshold, correctly triggering the insufficient context response rather than forcing an answer from marginal matches. #example: Ask a question question = "What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?" result = rag.query( question=question, top_k=5, distance_threshold=0.6 ) # Display results print("\n" + "="*60) print("ANSWER:") print("="*60) print(result['answer']) print("\n" + "="*60) print(f"CONFIDENCE: {result['confidence'].upper()}") print("="*60) print("\nSOURCES:") for i, source in enumerate(result['sources'], 1): print(f" {i}. {source['source']} (Page {source['page']}) - Distance: {source['distance']:.4f}") print("="*60) #example: Ask a question question = "What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models?" result = rag.query( question=question, top_k=5, distance_threshold=0.6 ) # Display results print("\n" + "="*60) print("ANSWER:") print("="*60) print(result['answer']) print("\n" + "="*60) print(f"CONFIDENCE: {result['confidence'].upper()}") print("="*60) print("\nSOURCES:") for i, source in enumerate(result['sources'], 1): print(f" {i}. {source['source']} (Page {source['page']}) - Distance: {source['distance']:.4f}") print("="*60) Output: uestion: What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models? uestion: What problem does FLoRA aim to address in the context of parameter-efficient fine-tuning (PEFT) for large language models? Generating answer... Generating answer... Generating with llama3.2 ... Generating with llama3.2 ... Generated 649 characters Generated 649 characters ============================================================ ============================================================ ANSWER: ANSWER: ============================================================ ============================================================ According to Source 3 [Source 3: flora.pdf, Page 1], FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs), with a focus on reducing inference-time latencies. The authors highlight that despite the emergence of various parameter-efficient fine-tuning methods (PEFT) such as LoRA and parallel adapters, there is still a significant degree of unexplored subject matter. According to Source 3 [Source 3: flora.pdf, Page 1], FLoRA aims to address the problem of parameter-efficient fine-tuning for large language models (LLMs), with a focus on reducing inference-time latencies. The authors highlight that despite the emergence of various parameter-efficient fine-tuning methods (PEFT) such as LoRA and parallel adapters, there is still a significant degree of unexplored subject matter. Furthermore, according to Source 4 [Source 4: flora.pdf, Page 1], FLoRA proposes a family of fused forward-backward adapters (FFBA) that combine ideas from popular PEFT methods to improve fine-tuning accuracies and minimize latency. Furthermore, according to Source 4 [Source 4: flora.pdf, Page 1], FLoRA proposes a family of fused forward-backward adapters (FFBA) that combine ideas from popular PEFT methods to improve fine-tuning accuracies and minimize latency. ============================================================ ============================================================ CONFIDENCE: HIGH CONFIDENCE: HIGH ============================================================ ============================================================ SOURCES: SOURCES: 1. flora.pdf (Page 9) - Distance: 0.2233 1. flora.pdf (Page 9) - Distance: 0.2233 2. flora.pdf (Page 9) - Distance: 0.2614 2. flora.pdf (Page 9) - Distance: 0.2614 3. flora.pdf (Page 1) - Distance: 0.2660 3. flora.pdf (Page 1) - Distance: 0.2660 4. flora.pdf (Page 1) - Distance: 0.2805 4. flora.pdf (Page 1) - Distance: 0.2805 5. flora.pdf (Page 10) - Distance: 0.2863 5. flora.pdf (Page 10) - Distance: 0.2863 ============================================================ ============================================================ In the example and output above, the system processes the question as thus; Converted the question to a vector Found the 5 most similar chunks (distances 0.22-0.28) These chunks came from pages 1, 9, and 10 LLM read them and answered: "FLoRA proposes a family of fused forward-backward adapters (FFBA) that combine ideas from popular PEFT methods to improve fine-tuning accuracies and minimize latency" 3.0 Conclusion 3.0 Conclusion This post shows how to build a complete RAG system that runs completely on local machine, with no requirements for internet connectivity for operation after initial set up. Using Ollama for embeddings and language generation combined with FAISS for vector storage, I explored how to build a system where documents never have to leave your computer, eliminating any potential for privacy risk while still supporting full intelligence of modern large language models. As shown in the FLoRA paper case study, this is effective, returning accurate retrieval with distance scores ranging from 0.22 to 0.28, properly cited sources, and substantive answers directly addressing technical questions. Medical professionals can ask questions about patient records without HIPAA violations, legal teams can work with confidential case files with complete privilege protection, researchers can use proprietary datasets without risking intellectual property, and users in bandwidth-constrained environments can have access to sophisticated AI capabilities without stable internet access. This system proves that privacy, performance, and intelligence are not competing priorities but complementary aspects of well-designed AI solutions, providing a blueprint for those who are looking to leverage advanced AI capabilities while maintaining complete data sovereignty and control. 4.0 References 4.0 References l FLoRA Paper: Gowda, D., Song, S., Lee, J., & Goka, H. (2025). FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs. arXiv:2511.00050. Available at: https://arxiv.org/pdf/2511.00050 https://arxiv.org/pdf/2511.00050 https://arxiv.org/pdf/2511.00050 l Ollama Documentation: https://docs.ollama.com https://docs.ollama.com https://docs.ollama.com l FAISS: Facebook AI Similarity Search. https://github.com/facebookresearch/faiss https://github.com/facebookresearch/faiss https://github.com/facebookresearch/faiss l Meta AI: Llama 3.2 Model Card. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ l Nomic AI: nomic-embed-text Embedding Model. https://www.nomic.ai/blog/nomic-embed-text-v1 https://www.nomic.ai/blog/nomic-embed-text-v1 Full notebook file can be found here: https://github.com/teedonk/Offline-RAG-system https://github.com/teedonk/Offline-RAG-system https://github.com/teedonk/Offline-RAG-system