One recurring problem in software teams is onboarding. You hire a new developer, and suddenly you realize how much knowledge is scattered across: Code Documentation Recorded Team meetings Code Documentation Recorded Team meetings Even when everything is documented, new developers still ask the same questions: What is the architecture of this project? How do I add a module? Where is the driver layer implemented? How do I run this in Docker? What is the architecture of this project? What is the architecture of this project? How do I add a module? How do I add a module? Where is the driver layer implemented? Where is the driver layer implemented? How do I run this in Docker? How do I run this in Docker? I wanted to solve this problem for my project OpenSCADA Lite, so I decided to build something interesting: A local AI assistant trained on the entire project. OpenSCADA Lite local AI assistant trained on the entire project Not using external APIs.Not sending code outside the company.Just a local Retrieval-Augmented Generation (RAG) pipeline. After some tweaks, it worked even on very modest hardware. My Main Goals Instead of telling new developers, "Read these 30 documents and ask me if you have questions." They can simply ask: "How do I create a new module in this system?" And the AI answers using our own codebase and documentation The Data I Used to Train the Assistant The system indexes three main sources: 1. The Entire Codebase 1. The Entire Codebase Modules, classes, and architecture from the project. 2. Documentation 2. Documentation README, notes, and configuration explanations. 3. Development Conversations 3. Development Conversations All the ChatGPT conversations I had while building the project. This is actually extremely valuable because it contains: Design decisions Alternatives explored Architectural reasoning Design decisions Alternatives explored Architectural reasoning So instead of losing that knowledge, the AI can use it. Architecture The system is a classic RAG pipeline: RAG pipeline Step 1 — Chunking the Information (The Most Important Part) The biggest mistake people make with RAG systems is bad chunking. bad chunking Good chunks = good answers. I split the project into around: ~148 chunks Code modules README sections Chat discussions Documentation blocks ~148 chunks ~148 chunks Code modules README sections Chat discussions Documentation blocks Example of how ChatGPT conversations were stored: ## Prompt: My question is: what do we use as rule engine? ## Response: You're asking which technology or library to use for a rule engine in Python for SCADA systems. Option A: Custom Lightweight Rule Engine Why: - Full control - Async friendly - Easy integration with DTOs How: Store rules in JSON/YAML and evaluate conditions safely. ## Prompt: My question is: what do we use as rule engine? ## Response: You're asking which technology or library to use for a rule engine in Python for SCADA systems. Option A: Custom Lightweight Rule Engine Why: - Full control - Async friendly - Easy integration with DTOs How: Store rules in JSON/YAML and evaluate conditions safely. This formatting preserves question → reasoning → decision. question → reasoning → decision Which is gold for an AI assistant. Step 2 — Generating Embeddings Each chunk is converted into a vector using: multi-qa-MiniLM-L6-cos-v1 multi-qa-MiniLM-L6-cos-v1 This produces: 384-dimension embeddings Fast generation Very good semantic search performance 384-dimension embeddings 384-dimension embeddings Fast generation Very good semantic search performance Even on CPU. This step transforms the project knowledge into something the AI can search. Step 3 — Building the FAISS Index All embeddings are stored in a FAISS index. FAISS index In my after several tests: ~148 vectors Index size: about 60 KB Extremely fast similarity search ~148 vectors Index size: about 60 KB 60 KB Extremely fast similarity search When someone asks a question, the system retrieves the most relevant chunks from this index. most relevant chunks Step 4 — Choosing an LLM That Actually Runs on My Hardware Here is where things got interesting. My setup is not exactly cutting edge: CPU: i7-2600RAM: 32 GBGPU: GTX 1050 Ti (CUDA 6.1) Modern AI stacks don’t like this GPU anymore. PyTorch dropped support for this architecture in newer CUDA builds. So I had two problems: Find a model good with code Make it run on old hardware Find a model good with code code Make it run on old hardware old hardware First Attempt: CodeLlama I started with Code Llama GGUF models. Code Llama They were promising, but: GPU support was problematic CPU inference was slow Some models were not well optimized for my setup GPU support was problematic CPU inference was slow Some models were not well optimized for my setup So I kept experimenting. The Model That Finally Worked The one that ended up working best was: DeepSeek Coder 6.7B Instruct (Q5_K_M quantization) DeepSeek Coder 6.7B Instruct (Q5_K_M quantization) Model file: deepseek-coder-6.7b-instruct-q5_k_m.gguf deepseek-coder-6.7b-instruct-q5_k_m.gguf Loaded with: llama.cpp llama.cpp This was the key. Why this worked: GGUF format optimized for local inference Quantized model (fits in RAM) Works with CPU and older GPUs Good performance for code understanding GGUF format optimized for local inference Quantized model (fits in RAM) Works with CPU and older GPUs Good performance for code understanding This combination finally made the system stable. Performance Reality Is it fast? No. But it works. Query time: 5–10 minutes per question 5–10 minutes per question On this machine. But the answers are: Accurate Grounded in the project Often surprisingly detailed Accurate Grounded in the project Often surprisingly detailed Examples of Questions the Model Can Answer Basic Question Question Question What is the name of the project? Answer Answer OpenSCADA-Lite Simple but correct. Installation Question Question Question Can I use Docker? Answer Answer Yes, Docker can be used to containerize the project and run it consistently across systems. (The model then explains how Docker works and how to run it.) Real Developer Question This is where it becomes powerful. Question Question How do I create a new OPC UA driver? Answer Answer The model explains: Which class to extend Where drivers are registered How to connect using asyncua Example code Which class to extend Where drivers are registered How to connect using asyncua Example code And it produces something like this: from asyncua import Client from openscada_lite.modules.communication.drivers.server_protocol import ServerProtocol class OPCUAClientDriver(Protocol): def __init__(self, server_url, **kwargs): self.server_url = server_url self.client = None async def start(self): self.client = Client(self.server_url) await self.client.connect() from asyncua import Client from openscada_lite.modules.communication.drivers.server_protocol import ServerProtocol class OPCUAClientDriver(Protocol): def __init__(self, server_url, **kwargs): self.server_url = server_url self.client = None async def start(self): self.client = Client(self.server_url) await self.client.connect() This is knowledge extracted directly from the project structure. What This Means for Engineering Teams This approach changes onboarding. Instead of: Weeks of KT sessions. You get: An AI that knows your architecture. AI that knows your architecture Developers can ask: Where is the event bus implemented? How do modules communicate? How do I add a datapoint? Where is the event bus implemented? How do modules communicate? How do I add a datapoint? And the system answers using your code. Lessons Learned Building This Several things surprised me. Chunking matters more than the model Bad chunks = bad AI. Hardware still matters Modern AI tooling assumes newer GPUs. Older GPUs require alternative stacks like llama.cpp. Code-focused models make a huge difference General LLMs perform worse than models trained for code. You don’t need a data center to build useful AI This entire system runs locally. If you want to try it, I published the full code here: https://github.com/boadadf/rag_scripts If you want to try it, I published the full code here: https://github.com/boadadf/rag_scripts https://github.com/boadadf/rag_scripts