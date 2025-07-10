How to Make Your Codebase Searchable With AI (Embeddings or Not)

by Shridivya SharmaJuly 10th, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Build a local AI-powered codebase Q&A agent using embeddings or keyword search. Boost dev productivity by asking your codebase direct questions.
featured image - How to Make Your Codebase Searchable With AI (Embeddings or Not)
Shridivya Sharma HackerNoon profile picture
0-item
1-item

If you’ve ever had to dive into a complex codebase and felt overwhelmed by the sheer volume of files, classes, and configuration, you’re not alone. Many developers spend countless hours trying to understand codebases they didn't write. This is where a Local Codebase Q&A Agent comes in handy.

In this post, we'll explore:


  1. What a Codebase Q&A Agent is
  2. How to build one using embeddings
  3. How to make it work even if embeddings are not available


What is a Codebase Q&A Agent?

A Codebase Q&A Agent is essentially an AI assistant that can answer your questions about your own source code. Think of it as an AI pair programmer that understands the structure and content of your project.

Instead of manually searching through files, you can ask questions like:


  • "Where is the database connection established?"
  • "What handles user authentication?"
  • "Where are API endpoints defined?"


How Does It Work?

The core idea involves:

  • Loading code files from local directories
  • Splitting them into smaller, manageable text chunks
  • Generating embeddings for these chunks
  • Storing them in a vector database
  • Retrieving relevant chunks based on user queries


Building the Agent with Embeddings

Requirements

pip install langchain langchain-openai faiss-cpu tiktoken

Set the following environment variables:

export AZURE_OPENAI_KEY=your-key
export AZURE_OPENAI_ENDPOINT=https://your-endpoint.openai.azure.com/
export AZURE_OPENAI_DEPLOYMENT=your-embedding-deployment

Code to Prepare and Save the Index

import os
import glob
import time
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import AzureOpenAIEmbeddings
from langchain_community.vectorstores import FAISS

REPOS = ["C:/YourCodeRepo"]
INDEX_SAVE_PATH = "codebase_index"

def prepare_and_save_index(repos: List[str], save_path: str = INDEX_SAVE_PATH):
    all_docs = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        all_docs.append({"page_content": content, "metadata": {"source": file_path}})
                except Exception as e:
                    print(f"Skipping {file_path}: {e}")

    splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=100)
    chunks = splitter.create_documents([doc['page_content'] for doc in all_docs])

    embeddings = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
        azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT"),
        openai_api_key=os.getenv("AZURE_OPENAI_KEY"),
    )

    embedded_chunks = []

    for i, chunk in enumerate(chunks):
        success = False
        for attempt in range(5):
            try:
                embedded_chunks.append(chunk)
                break
            except Exception as e:
                print(f"Error on chunk {i}, attempt {attempt+1}: {e}")
                time.sleep(60)

    vectorstore = FAISS.from_documents(embedded_chunks, embeddings)
    vectorstore.save_local(save_path)
    print(f"Vectorstore saved to {save_path}")

if __name__ == "__main__":
    prepare_and_save_index(REPOS)

What If Embeddings Are Not Available?

If you're on Azure OpenAI and only have access to models like gpt-4o (which don’t support embeddings), you can still build a less powerful version of this agent using basic text search.

Keyword Search-Based Q&A

import os
import glob

def search_codebase(repos: List[str], query: str):
    matching_snippets = []
    for repo in repos:
        for ext in ["**/*.cs", "**/*.yaml", "**/*.go"]:
            pattern = os.path.join(repo, ext)
            file_paths = glob.glob(pattern, recursive=True)
            for file_path in file_paths:
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        content = f.read()
                        if query.lower() in content.lower():
                            matching_snippets.append((file_path, content[:500]))
                except Exception as e:
                    continue

    return matching_snippets

if __name__ == "__main__":
    results = search_codebase(["C:/YourCodeRepo"], "authentication")
    for path, snippet in results:
        print(f"Found in {path}:")
        print(snippet)
        print("-"*40)

This approach isn't as intelligent but can still provide value when used with a powerful language model by pasting the retrieved code snippets directly into your prompt.

Conclusion

A Codebase Q&A Agent with embeddings offers fast, accurate answers to technical questions by leveraging AI vector search. When embeddings aren't available, fallback approaches using keyword search and prompt engineering can still make your workflow more efficient.

Try both and see which one fits your needs best!

Databricks <> AWS Marketplace
L O A D I N G
. . . comments & more!

About Author

Shridivya Sharma HackerNoon profile picture
Shridivya Sharma@@pheonix
Read my storiesAbout @@pheonix

TOPICS

purcat-imgmachine-learning#ai-chatbot#hackathons#chatgpt#azure-ai#openai#text-embeddings#faiss

THIS ARTICLE WAS FEATURED IN...

Arweave
Arweave
Read on Terminal Reader Terminal
Read this story w/o Javascript Lite
Hackernoon
X
Bsky

RELATED STORIES

Article Thumbnail
Why Legacy Code Still Runs the World
by @pheonix
Jun 10, 2025
#software-modernization
Article Thumbnail
How I Merged and Validated Two JSON Files in Go
by @pheonix
Jun 24, 2025
#golang
Article Thumbnail
AI Agents for Beginners: Building Your First AI Agent
by slaknoah
Jun 06, 2025
#ai
Article Thumbnail
What Conway, Ants, and Apache Kafka Can Teach Us About AI System Design
by confluent
Jun 02, 2025
#multi-agent-systems
Article Thumbnail
10 Best AI Chatbot Builder for Your Business in 2022
by aasif-khan
Jun 04, 2020
#chatbots
Join HackerNoonloading
Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks