If you’ve ever had to dive into a complex codebase and felt overwhelmed by the sheer volume of files, classes, and configuration, you’re not alone. Many developers spend countless hours trying to understand codebases they didn't write. This is where a Local Codebase Q&A Agent comes in handy. Local Codebase Q&A Agent In this post, we'll explore: What a Codebase Q&A Agent isHow to build one using embeddingsHow to make it work even if embeddings are not available What a Codebase Q&A Agent is How to build one using embeddings How to make it work even if embeddings are not available What is a Codebase Q&A Agent? A Codebase Q&A Agent is essentially an AI assistant that can answer your questions about your own source code. Think of it as an AI pair programmer that understands the structure and content of your project. Instead of manually searching through files, you can ask questions like: "Where is the database connection established?""What handles user authentication?""Where are API endpoints defined?" "Where is the database connection established?" "What handles user authentication?" "Where are API endpoints defined?" How Does It Work? The core idea involves: Loading code files from local directoriesSplitting them into smaller, manageable text chunksGenerating embeddings for these chunksStoring them in a vector databaseRetrieving relevant chunks based on user queries Loading code files from local directories Splitting them into smaller, manageable text chunks Generating embeddings for these chunks Storing them in a vector database Retrieving relevant chunks based on user queries Building the Agent with Embeddings Requirements pip install langchain langchain-openai faiss-cpu tiktoken pip install langchain langchain-openai faiss-cpu tiktoken Set the following environment variables: export AZURE_OPENAI_KEY=your-key
export AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
export AZURE_OPENAI_DEPLOYMENT=your-embedding-deployment export AZURE_OPENAI_KEY=your-key
export AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
export AZURE_OPENAI_DEPLOYMENT=your-embedding-deployment Code to Prepare and Save the Index import os
import glob
import time
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS

REPOS = ["C:/YourCodeRepo"]
INDEX_SAVE_PATH = "codebase_index"

def prepare_and_save_index(repos: List[str], save_path: str = INDEX_SAVE_PATH):
    all_docs = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        all_docs.append({"page_content": content, "metadata": {"source": file_path}})
                except Exception as e:
                    print(f"Skipping {file_path}: {e}")

    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    chunks = splitter.create_documents([doc['page_content'] for doc in all_docs])

    embeddings = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
        openai_api_key=os.getenv("AZURE_OPENAI_KEY"),
    )

    embedded_chunks = []

    for i, chunk in enumerate(chunks):
        success = False
        for attempt in range(5):
            try:
                embedded_chunks.append(chunk)
                break
            except Exception as e:
                print(f"Error on chunk {i}, attempt {attempt+1}: {e}")
                time.sleep(60)

    vectorstore = FAISS.from_documents(embedded_chunks, embeddings)
    vectorstore.save_local(save_path)
    print(f"Vectorstore saved to {save_path}")

if __name__ == "__main__":
    prepare_and_save_index(REPOS) import os
import glob
import time
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS

REPOS = ["C:/YourCodeRepo"]
INDEX_SAVE_PATH = "codebase_index"

def prepare_and_save_index(repos: List[str], save_path: str = INDEX_SAVE_PATH):
    all_docs = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        all_docs.append({"page_content": content, "metadata": {"source": file_path}})
                except Exception as e:
                    print(f"Skipping {file_path}: {e}")

    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    chunks = splitter.create_documents([doc['page_content'] for doc in all_docs])

    embeddings = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
        openai_api_key=os.getenv("AZURE_OPENAI_KEY"),
    )

    embedded_chunks = []

    for i, chunk in enumerate(chunks):
        success = False
        for attempt in range(5):
            try:
                embedded_chunks.append(chunk)
                break
            except Exception as e:
                print(f"Error on chunk {i}, attempt {attempt+1}: {e}")
                time.sleep(60)

    vectorstore = FAISS.from_documents(embedded_chunks, embeddings)
    vectorstore.save_local(save_path)
    print(f"Vectorstore saved to {save_path}")

if __name__ == "__main__":
    prepare_and_save_index(REPOS) What If Embeddings Are Not Available? If you're on Azure OpenAI and only have access to models like gpt-4o (which don’t support embeddings), you can still build a less powerful version of this agent using basic text search. gpt-4o Keyword Search-Based Q&A import os
import glob

def search_codebase(repos: List[str], query: str):
    matching_snippets = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        if query.lower() in content.lower():
                            matching_snippets.append((file_path, content[:500]))
                except Exception as e:
                    continue

    return matching_snippets

if __name__ == "__main__":
    results = search_codebase(["C:/YourCodeRepo"], "authentication")
    for path, snippet in results:
        print(f"Found in {path}:")
        print(snippet)
        print("-"*40) import os
import glob

def search_codebase(repos: List[str], query: str):
    matching_snippets = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        if query.lower() in content.lower():
                            matching_snippets.append((file_path, content[:500]))
                except Exception as e:
                    continue

    return matching_snippets

if __name__ == "__main__":
    results = search_codebase(["C:/YourCodeRepo"], "authentication")
    for path, snippet in results:
        print(f"Found in {path}:")
        print(snippet)
        print("-"*40) This approach isn't as intelligent but can still provide value when used with a powerful language model by pasting the retrieved code snippets directly into your prompt. Conclusion A Codebase Q&A Agent with embeddings offers fast, accurate answers to technical questions by leveraging AI vector search. When embeddings aren't available, fallback approaches using keyword search and prompt engineering can still make your workflow more efficient. Try both and see which one fits your needs best!

This story contains AI-generated text. The author has used AI either for research, to generate outlines, or write the text itself. 

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

How to Make Your Codebase Searchable With AI (Embeddings or Not)

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

How I Merged and Validated Two JSON Files in Go

Why Machine Learning Loves GPUs: Moore’s Law, Dennard Scaling, and the Rise of CUDA & HIP

GPUs Trade Complexity for Massive Parallelism: What Every Machine Learning Engineer Should Know

10 Best AI Chatbot Builder for Your Business in 2022

11 Of The Best Artificial Intelligence Enterprise Chatbots in 2022

5 Common Use Cases of a WhatsApp Chatbot

How I Merged and Validated Two JSON Files in Go

Why Machine Learning Loves GPUs: Moore’s Law, Dennard Scaling, and the Rise of CUDA & HIP

GPUs Trade Complexity for Massive Parallelism: What Every Machine Learning Engineer Should Know

10 Best AI Chatbot Builder for Your Business in 2022

11 Of The Best Artificial Intelligence Enterprise Chatbots in 2022

5 Common Use Cases of a WhatsApp Chatbot

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps