How to Make Your Codebase Searchable With AI (Embeddings or Not)

If you’ve ever had to dive into a complex codebase and felt overwhelmed by the sheer volume of files, classes, and configuration, you’re not alone. Many developers spend countless hours trying to understand codebases they didn't write. This is where a Local Codebase Q&A Agent comes in handy.

In this post, we'll explore:

What a Codebase Q&A Agent is
How to build one using embeddings
How to make it work even if embeddings are not available

What is a Codebase Q&A Agent?

A Codebase Q&A Agent is essentially an AI assistant that can answer your questions about your own source code. Think of it as an AI pair programmer that understands the structure and content of your project.

Instead of manually searching through files, you can ask questions like:

"Where is the database connection established?"
"What handles user authentication?"
"Where are API endpoints defined?"

How Does It Work?

The core idea involves:

Loading code files from local directories
Splitting them into smaller, manageable text chunks
Generating embeddings for these chunks
Storing them in a vector database
Retrieving relevant chunks based on user queries

Building the Agent with Embeddings

Requirements

pip install langchain langchain-openai faiss-cpu tiktoken

Set the following environment variables:

export AZURE_OPENAI_KEY=your-key
export AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
export AZURE_OPENAI_DEPLOYMENT=your-embedding-deployment

Code to Prepare and Save the Index

import os
import glob
import time
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS

REPOS = ["C:/YourCodeRepo"]
INDEX_SAVE_PATH = "codebase_index"

def prepare_and_save_index(repos: List[str], save_path: str = INDEX_SAVE_PATH):
    all_docs = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        all_docs.append({"page_content": content, "metadata": {"source": file_path}})
                except Exception as e:
                    print(f"Skipping {file_path}: {e}")

    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    chunks = splitter.create_documents([doc['page_content'] for doc in all_docs])

    embeddings = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
        openai_api_key=os.getenv("AZURE_OPENAI_KEY"),
    )

    embedded_chunks = []

    for i, chunk in enumerate(chunks):
        success = False
        for attempt in range(5):
            try:
                embedded_chunks.append(chunk)
                break
            except Exception as e:
                print(f"Error on chunk {i}, attempt {attempt+1}: {e}")
                time.sleep(60)

    vectorstore = FAISS.from_documents(embedded_chunks, embeddings)
    vectorstore.save_local(save_path)
    print(f"Vectorstore saved to {save_path}")

if __name__ == "__main__":
    prepare_and_save_index(REPOS)

What If Embeddings Are Not Available?

If you're on Azure OpenAI and only have access to models like gpt-4o (which don’t support embeddings), you can still build a less powerful version of this agent using basic text search.

Keyword Search-Based Q&A

import os
import glob

def search_codebase(repos: List[str], query: str):
    matching_snippets = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        if query.lower() in content.lower():
                            matching_snippets.append((file_path, content[:500]))
                except Exception as e:
                    continue

    return matching_snippets

if __name__ == "__main__":
    results = search_codebase(["C:/YourCodeRepo"], "authentication")
    for path, snippet in results:
        print(f"Found in {path}:")
        print(snippet)
        print("-"*40)

This approach isn't as intelligent but can still provide value when used with a powerful language model by pasting the retrieved code snippets directly into your prompt.

Conclusion

A Codebase Q&A Agent with embeddings offers fast, accurate answers to technical questions by leveraging AI vector search. When embeddings aren't available, fallback approaches using keyword search and prompt engineering can still make your workflow more efficient.

Try both and see which one fits your needs best!