One recurring problem in software teams is onboarding.
You hire a new developer, and suddenly you realize how much knowledge is scattered across:
- Code
- Documentation
- Recorded Team meetings
Even when everything is documented, new developers still ask the same questions:
- What is the architecture of this project?
- How do I add a module?
- Where is the driver layer implemented?
- How do I run this in Docker?
I wanted to solve this problem for my project OpenSCADA Lite, so I decided to build something interesting: A local AI assistant trained on the entire project.
Not using external APIs.
Not sending code outside the company.
Just a local Retrieval-Augmented Generation (RAG) pipeline.
After some tweaks, it worked even on very modest hardware.
My Main Goals
Instead of telling new developers, "Read these 30 documents and ask me if you have questions."
They can simply ask: "How do I create a new module in this system?"
And the AI answers using our own codebase and documentation
The Data I Used to Train the Assistant
The system indexes three main sources:
1. The Entire Codebase
Modules, classes, and architecture from the project.
2. Documentation
README, notes, and configuration explanations.
3. Development Conversations
All the ChatGPT conversations I had while building the project.
This is actually extremely valuable because it contains:
- Design decisions
- Alternatives explored
- Architectural reasoning
So instead of losing that knowledge, the AI can use it.
Architecture
The system is a classic RAG pipeline:
Step 1 — Chunking the Information (The Most Important Part)
The biggest mistake people make with RAG systems is bad chunking.
Good chunks = good answers.
I split the project into around:
- ~148 chunks
- Code modules
- README sections
- Chat discussions
- Documentation blocks
Example of how ChatGPT conversations were stored:
## Prompt:
My question is: what do we use as rule engine?
## Response:
You're asking which technology or library to use for a rule engine in Python for SCADA systems.
Option A: Custom Lightweight Rule Engine
Why:
- Full control
- Async friendly
- Easy integration with DTOs
How:
Store rules in JSON/YAML and evaluate conditions safely.
This formatting preserves question → reasoning → decision.
Which is gold for an AI assistant.
Step 2 — Generating Embeddings
Each chunk is converted into a vector using:
multi-qa-MiniLM-L6-cos-v1
This produces:
- 384-dimension embeddings
- Fast generation
- Very good semantic search performance
Even on CPU.
This step transforms the project knowledge into something the AI can search.
Step 3 — Building the FAISS Index
All embeddings are stored in a FAISS index.
In my after several tests:
- ~148 vectors
- Index size: about 60 KB
- Extremely fast similarity search
When someone asks a question, the system retrieves the most relevant chunks from this index.
Step 4 — Choosing an LLM That Actually Runs on My Hardware
Here is where things got interesting.
My setup is not exactly cutting edge:
CPU: i7-2600
RAM: 32 GB
GPU: GTX 1050 Ti (CUDA 6.1)
Modern AI stacks don’t like this GPU anymore.
PyTorch dropped support for this architecture in newer CUDA builds.
So I had two problems:
- Find a model good with code
- Make it run on old hardware
First Attempt: CodeLlama
I started with Code Llama GGUF models.
They were promising, but:
- GPU support was problematic
- CPU inference was slow
- Some models were not well optimized for my setup
So I kept experimenting.
The Model That Finally Worked
The one that ended up working best was:
DeepSeek Coder 6.7B Instruct (Q5_K_M quantization)
Model file:
deepseek-coder-6.7b-instruct-q5_k_m.gguf
Loaded with:
llama.cpp
This was the key.
Why this worked:
- GGUF format optimized for local inference
- Quantized model (fits in RAM)
- Works with CPU and older GPUs
- Good performance for code understanding
This combination finally made the system stable.
Performance Reality
Is it fast?
No.
But it works.
Query time:
5–10 minutes per question
On this machine.
But the answers are:
- Accurate
- Grounded in the project
- Often surprisingly detailed
Examples of Questions the Model Can Answer
Basic Question
Question
What is the name of the project?
Answer
OpenSCADA-Lite
Simple but correct.
Installation Question
Question
Can I use Docker?
Answer
Yes, Docker can be used to containerize the project and run it consistently across systems.
(The model then explains how Docker works and how to run it.)
Real Developer Question
This is where it becomes powerful.
Question
How do I create a new OPC UA driver?
Answer
The model explains:
- Which class to extend
- Where drivers are registered
- How to connect using asyncua
- Example code
And it produces something like this:
from asyncua import Client
from openscada_lite.modules.communication.drivers.server_protocol import ServerProtocol
class OPCUAClientDriver(Protocol):
def __init__(self, server_url, **kwargs):
self.server_url = server_url
self.client = None
async def start(self):
self.client = Client(self.server_url)
await self.client.connect()
This is knowledge extracted directly from the project structure.
What This Means for Engineering Teams
This approach changes onboarding.
Instead of:
Weeks of KT sessions.
You get:
An AI that knows your architecture.
Developers can ask:
- Where is the event bus implemented?
- How do modules communicate?
- How do I add a datapoint?
And the system answers using your code.
Lessons Learned Building This
Several things surprised me.
Chunking matters more than the model
Bad chunks = bad AI.
Hardware still matters
Modern AI tooling assumes newer GPUs.
Older GPUs require alternative stacks like llama.cpp.
Code-focused models make a huge difference
General LLMs perform worse than models trained for code.
You don’t need a data center to build useful AI
This entire system runs locally.
If you want to try it, I published the full code here: https://github.com/boadadf/rag_scripts
