I Built a Project-Specific LLM From My Own Codebase

Written by zednan | Published 2026/03/13
Tech Story Tags: rag | local-ai-assistant | codebase-ai-search | faiss-vector-database | deepseek-coder-model | llama.cpp-local-llm | ai-for-software-teams | ai-for-developer-onboarding

TLDRA developer built a local AI assistant to help new engineers understand a complex codebase. Using a Retrieval-Augmented Generation (RAG) pipeline with FAISS, DeepSeek Coder, and llama.cpp, the system indexes project code, documentation, and design conversations so developers can ask questions about architecture, modules, or setup and receive answers grounded in the project itself. The setup runs entirely on modest hardware, demonstrating that teams can build practical AI tooling for onboarding and knowledge retention without cloud APIs or expensive infrastructure.via the TL;DR App

One recurring problem in software teams is onboarding.

You hire a new developer, and suddenly you realize how much knowledge is scattered across:

  • Code
  • Documentation
  • Recorded Team meetings

Even when everything is documented, new developers still ask the same questions:

  • What is the architecture of this project?
  • How do I add a module?
  • Where is the driver layer implemented?
  • How do I run this in Docker?

I wanted to solve this problem for my project OpenSCADA Lite, so I decided to build something interesting: A local AI assistant trained on the entire project.

Not using external APIs.
Not sending code outside the company.
Just a local Retrieval-Augmented Generation (RAG) pipeline.

After some tweaks, it worked even on very modest hardware.


My Main Goals

Instead of telling new developers, "Read these 30 documents and ask me if you have questions."

They can simply ask: "How do I create a new module in this system?"

And the AI answers using our own codebase and documentation

The Data I Used to Train the Assistant

The system indexes three main sources:

1. The Entire Codebase

Modules, classes, and architecture from the project.

2. Documentation

README, notes, and configuration explanations.

3. Development Conversations

All the ChatGPT conversations I had while building the project.

This is actually extremely valuable because it contains:

  • Design decisions
  • Alternatives explored
  • Architectural reasoning

So instead of losing that knowledge, the AI can use it.

Architecture

The system is a classic RAG pipeline:

Step 1 — Chunking the Information (The Most Important Part)

The biggest mistake people make with RAG systems is bad chunking.

Good chunks = good answers.

I split the project into around:

  • ~148 chunks
  • Code modules
  • README sections
  • Chat discussions
  • Documentation blocks

Example of how ChatGPT conversations were stored:

## Prompt:
My question is: what do we use as rule engine?

## Response:
You're asking which technology or library to use for a rule engine in Python for SCADA systems.

Option A: Custom Lightweight Rule Engine
Why:
- Full control
- Async friendly
- Easy integration with DTOs

How:
Store rules in JSON/YAML and evaluate conditions safely.

This formatting preserves question → reasoning → decision.

Which is gold for an AI assistant.

Step 2 — Generating Embeddings

Each chunk is converted into a vector using:

multi-qa-MiniLM-L6-cos-v1

This produces:

  • 384-dimension embeddings
  • Fast generation
  • Very good semantic search performance

Even on CPU.

This step transforms the project knowledge into something the AI can search.

Step 3 — Building the FAISS Index

All embeddings are stored in a FAISS index.

In my after several tests:

  • ~148 vectors
  • Index size: about 60 KB
  • Extremely fast similarity search

When someone asks a question, the system retrieves the most relevant chunks from this index.

Step 4 — Choosing an LLM That Actually Runs on My Hardware

Here is where things got interesting.

My setup is not exactly cutting edge:

CPU: i7-2600
RAM: 32 GB
GPU: GTX 1050 Ti (CUDA 6.1)

Modern AI stacks don’t like this GPU anymore.

PyTorch dropped support for this architecture in newer CUDA builds.

So I had two problems:

  1. Find a model good with code
  2. Make it run on old hardware

First Attempt: CodeLlama

I started with Code Llama GGUF models.

They were promising, but:

  • GPU support was problematic
  • CPU inference was slow
  • Some models were not well optimized for my setup

So I kept experimenting.

The Model That Finally Worked

The one that ended up working best was:

DeepSeek Coder 6.7B Instruct (Q5_K_M quantization)

Model file:

deepseek-coder-6.7b-instruct-q5_k_m.gguf

Loaded with:

llama.cpp

This was the key.

Why this worked:

  • GGUF format optimized for local inference
  • Quantized model (fits in RAM)
  • Works with CPU and older GPUs
  • Good performance for code understanding

This combination finally made the system stable.

Performance Reality

Is it fast?

No.

But it works.

Query time:

5–10 minutes per question

On this machine.

But the answers are:

  • Accurate
  • Grounded in the project
  • Often surprisingly detailed

Examples of Questions the Model Can Answer

Basic Question

Question

What is the name of the project?

Answer

OpenSCADA-Lite

Simple but correct.

Installation Question

Question

Can I use Docker?

Answer

Yes, Docker can be used to containerize the project and run it consistently across systems.

(The model then explains how Docker works and how to run it.)

Real Developer Question

This is where it becomes powerful.

Question

How do I create a new OPC UA driver?

Answer

The model explains:

  • Which class to extend
  • Where drivers are registered
  • How to connect using asyncua
  • Example code

And it produces something like this:

from asyncua import Client
from openscada_lite.modules.communication.drivers.server_protocol import ServerProtocol

class OPCUAClientDriver(Protocol):
    def __init__(self, server_url, **kwargs):
        self.server_url = server_url
        self.client = None

    async def start(self):
        self.client = Client(self.server_url)
        await self.client.connect()

This is knowledge extracted directly from the project structure.

What This Means for Engineering Teams

This approach changes onboarding.

Instead of:

Weeks of KT sessions.

You get:

An AI that knows your architecture.

Developers can ask:

  • Where is the event bus implemented?
  • How do modules communicate?
  • How do I add a datapoint?

And the system answers using your code.

Lessons Learned Building This

Several things surprised me.

Chunking matters more than the model

Bad chunks = bad AI.

Hardware still matters

Modern AI tooling assumes newer GPUs.

Older GPUs require alternative stacks like llama.cpp.

Code-focused models make a huge difference

General LLMs perform worse than models trained for code.

You don’t need a data center to build useful AI

This entire system runs locally.

If you want to try it, I published the full code here: https://github.com/boadadf/rag_scripts


Published by HackerNoon on 2026/03/13