A Practical 5-Step Guide to Do Semantic Search on Your Private Data With the Help of LLMs

If you have a lot of private enterprise data, how can you use a ChatGPT-like AI system to help you search for relevant information faster? Or if you have lots of personal information, how can you find old memories without the exact keywords?

In this practical guide, I will show you 5 simple steps to implement semantic search with the help of LangChain, vector databases, and large language models. Best of all, I will use all open-source components that can be run locally on your own machine.

No need for any cloud SaaS or API keys, and your data will never leave your office or home.

Use Case Analysis

Our community has a Discourse based forum that has been running for over 5 years. There is a wealth of information, but sometimes, it is hard for newcomers to find the right answers to their questions quickly and succinctly.

There are a few reasons, which are very common for Google/pagerank based search solutions:

Search is based on keywords, and often, users do not know the right terms or exact wording.
Even if the user figured out the best search terms for their particular question, the return result is a list of top discussion threads. And the answer to the question might not be in the first few posts or could be buried deep in a prolonged discussion thread.
In general, search engines do not understand long sentences very well if you ask a long-winded question or two questions in one sentence.

You might ask another question, “ChatGPT v4 is already so good, why don’t you ask questions directly from chatGPT?” Again, there are several strong reasons:

ChatGPT (v3.5 or v4) model was trained on data before November 2021, thus, it does not have the latest information since then.
The latest OpenAI website retrieval plug-in can search over the website for the latest info, but it is still quite slow and often stuck.

It is neither efficient nor cost-effective for your enterprise's internal use.

You don’t want OpenAI and Microsoft to access your proprietary enterprise data or your personal data.

So, my goal is that if someone comes in and asks, “What is the easiest way to run an NKN node?”, he/she should be able to get a straight answer in a few sentences.

I use this public NKN forum only as an example to show you the typical dataset and typical questions and answers for a special enterprise knowledge base.

Design Approach

After some research and experiments, I narrowed down my plan into 5 steps:

Do a quick Proof of Concept using cloud service and API
Clean and prep my data
Convert my entire data set into a vector database
Use vector database to find the best matches to my query
Use LLM to distill the best answer to the original question

If you want to pause here and learn some basics about LangChain, vector databases, large language models, and how they work together, I will recommend this video and this video. And below, is a short introduction to these concepts.

Key Components

Langchain: LangChain is a framework for developing applications powered by language models. It is the glue or middleware that chains all the different components together to accomplish complex tasks semi-autonomously.

The most powerful and differentiated applications will not only call out to a language model via an API but will also:

Be data-aware: Connect a language model to other sources of data.

Be agentic: Allow a language model to interact with its environment.

Vector database: A vector database is a specialized storage and retrieval system designed to handle high-dimensional vector data efficiently and is often used in machine learning, artificial intelligence, and data analytics applications.

These databases enable users to perform similarity searches or nearest neighbor searches, allowing them to find the most similar items within the database based on their vector representations. Pinecone is one of the leading commercial vector databases.

There are open-source alternatives such as FAISS, Chroma, and Supabase.

Large Language Model: A large language model like GPT (Generative Pre-trained Transformer) is an advanced artificial intelligence model that utilizes deep learning techniques to process and generate human-like text.

With billions of parameters, GPT-3/4 is trained on massive amounts of data from diverse sources, allowing it to understand context, semantics, and syntax.

As a result, GPT-3/4 can perform various natural language processing tasks, such as text generation, translation, summarization, and question-answering, with remarkable accuracy and fluency.

There are several community-driven LLM models based on Meta’s LLaMA, e.g., Alpaca, Vicuna, Koala, and GPT4All, which can be run locally on a laptop (read my tutorial here).

Step 1: A Proof of Concept

Before you spend more effort on your custom solution, let’s first get familiar with the key framework LangChain and the components in a playground. Thanks to GUI-based flowise.ai, you can now play with this entire solution without any code.

This step is optional, and you can skip to step 2 if you want to go straight into coding locally or if you don’t have API keys for OpenAI or Pinecone.

Prerequisites:

You obtain the OpenAI API key and Pinecone API key in advance. Both are useful not only for this exercise but also for other AI-related experiments.
In Pinecone, you need to create a new index called “test” with dimension 1536. Note there is an environment name auto-generated by Pinecone to designate where this index instance is run: “us-west1-gcp-free.” You can use the default metrics of “Cosine.”
A small dataset in text file format (let’s say test.txt): it could be as simple as copy-paste a few paragraphs from a Word or PDF document. In addition, for privacy reasons, you want to use a very small data set and preferably publicly available information. Since both APIs are cloud-based.

Now. head to Flowise AI, download the code from their Github, and install and run locally (you need to have Node Javascript installed). You can access it from a browser: http://localhost:3000/marketplaces.

And from Marketplace, you can choose a template called “Conversational QA Retrieval Chain”, and use this template. The UI is self-explanatory: you fill in the OpenAI API key twice (one for embedding and one for LLM), the Pinecone API key, index name, environment, and metrics.

And upload the data set text file. It might take a couple of minutes depending on the size of your text film. Then you click the little chat bubble on the top right corner, and you can start to ask any questions!

So this will give you a visual understanding of the data processing pipeline of our forum use case, and help you implement your own solution by replacing all the cloud API with the local service using open-source software.

LangChain will stay put since it is an open-source software and the master glue we need.

Large language model: OpenAI -> LLaMA/Alpaca

Vector database: Pinecone -> FAISS

But before we do that, we first have to clean up and prepare your enterprise data for ingestion into the vector database.

Step 2: Clean and Prep Data

The NKN forum uses Discourse software, a popular open-source discussion and community forum solution. It uses the Postgres SQL database as its core data store, which has two tables that are especially relevant for our Q&A retrieval use case topics posts.

For example, the database schema of “posts” is listed partially below. The most relevant columns for us are: “topic_id” (matching to the topics table), “raw” (original post content in the text without formatting), and “score” (how popular or well-read is this post).

It is quite natural to map “topics” to questions (or “prompts”), and “posts::raw” to answers (“completion”). The terms in parentheses are for GPT-like LLMs. There are many implementation considerations and optimizations here, which would suffice for another article.

But I would like to keep it simple and choose only the best post out of the entire thread to match the original topic.

So first of all, we need to get both tables out of the database and into JSON files. To do that, I did a DB dump in Discourse, and then import it into a standalone Postgres SQL database running on my Macbook Pro M1.

From the standalone Postgres SQL, I can export JSON files for the two tables: posts and topics.

Then I wrote some simple Python script to do the following:

For each topic, use SQL to select the posts with the highest “score.”

The resulting “topic” and “post::raw” become one “prompt” and “completion” pair.

Save the above into a JSON file called training.json (you can use it for many other training purposes, e.g., finetuning an LLM model, using GPT-3’s finetuning API, etc.)

Write into a text file called training.txt: one line for prompt, and one line for completion, and separate each pair with two new lines.

Now your file should look a bit like this:

Step 3: Convert My Entire Data Set Into a Vector Database

I pasted the Python script below, which is mostly self-explanatory. A few notes:

You need to download or find the weights for your local LLM model. In my case, I used Alpaca 7B parameter 4-bit quantized version. You can find them on huggingface. There are many similar ones that are fine-tuned based on Meta’s LLaMA 7B/13B models, e.g., Alpaca, Vicuna, Koala, GPT4All. You can use any of them.
Optionally, you have followed my instruction and installed Llama.cpp locally and successfully ran Llama.cpp with the weights from above. So you know the weights are good.
The text splitter will split your long text file into chunks, each having a max length of 1,000.
Then, this chunk of text is fed into the LLM for embedding, which basically tokenizes the words, and converts them into vectors based on their meaning. Each chunk, after tokenization, might become up to 1,400 tokens. Since on average, 3 words create about 4 tokens. This is also the reason I have n_ctx = 2048 (default is 512) in order to have a large enough context window for embedding.
Finally, the vectors are inserted into the FAISS vector database and indexed. The database is saved as a local file in a directory named “faiss_index”.

from langchain.embeddings import LlamaCppEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.document_loaders import TextLoader

embeddings = LlamaCppEmbeddings(model_path="./ggml-alpaca-7b-q4.bin", n_ctx= 2048)

from langchain.document_loaders import TextLoader
loader = TextLoader('./training_full_clean.txt')
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, 
    chunk_overlap = 50,
    length_function = len)
docs = text_splitter.split_documents(documents)

db = FAISS.from_documents(docs, embeddings)
db.save_local("faiss_index")

# Now let's test it out
query = "Who is Whitfield Diffie?"
docs = db.similarity_search(query)
for doc in docs:
    print(doc.page_content)

I highly recommend you start with a very small dataset text file to test first. If your dataset is large, this will take a long time. My locally run Llama.cpp can embed about 20 tokens per second, so each chunk of 1,000 text words could take about 1 minute to do embedding.

My vector DB has about 7,500 entries, and it will take me about 100 hours on a Macbook Pro M1. The good news is that you only need to do this once. [Author note: please see comments to use different embedding module such as huggingface for much faster performance on large datasets]

Step 4: Use a Vector Database to Find the Best Matches to My Query

Again, I provide some simple Python code to do the query and summarization. Since we use a template chain called “RetrievalQA”, some details might not be obvious. But under the hood, it involved the following steps:

When a user types in a query, the large language model converts the text into embedding and a vector (basically an array of floating point numbers) representing the query.

The vector database then does a similarity search to find the candidate result vectors that are closest related to the query vector.

Now, we return those candidate result vectors, which are already encoded in the way that LLM understands, to LLM for final processing.

from langchain.vectorstores import FAISS
from langchain.llms import LlamaCpp
from langchain.embeddings import LlamaCppEmbeddings
from langchain.callbacks.base import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import RetrievalQA

# Load DB from local file
embeddings = LlamaCppEmbeddings(model_path="./ggml-alpaca-7b-q4.bin")
new_db = FAISS.load_local("faiss_index", embeddings)

# Use llama-cpp as the LLM for langchain
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
llm = LlamaCpp(
    model_path="./ggml-alpaca-7b-q4.bin",
    n_ctx= 2048,
    callback_manager=callback_manager, 
    verbose=True,
    use_mlock=True
)

retriever = new_db.as_retriever()

# Conversational QA retrieval chain
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        break

    response = qa.run(user_input)
    print(f"AI: {response}")

Step 5: Use LLM to Distill the Best Answer to the Original Question

In the final step and again under the hood, the “RetrieverQA” template automatically constructs a special prompt, which includes the following:

System message: Instruct the LLM what kind of task it needs to do, and in this case, “Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.”

User message: Add the best candidate result vectors as context, as well as the query vector at the end.

Once LLM receives such a prompt, it will try its best to sift through the candidate results and find the best answer to the original question. It summarizes, rephrases, and even finds separate responses to two separate questions in the original query (e.g., “Who is Stephen Wolfram, and how is he related to NKN?”).

Sometimes, we can be quite amazed at how capable LLMs have become to handle this type of Q&A task.

What Next?

Congratulations, you have just successfully built your first semantic search tool on your private data using an entirely open-source toolkit! Now, you are well on your way to your office conference room to show off your new demo! And what can you do next?

Try alternative components for vector databases or large language models that fit your data’s needs best.

Optimize, scale, and productize your semantic search solution for your entire company.

Dig deeper into LangChain and its infinite capabilities to solve other pressing work or life-related problems. Maybe you can even contribute your own chain as a template for other people to use.

Happy hacking! 🙂