Previous Piece —
One of the use cases everyone wanted to see after ChatGpt broke out to the public was to have a ChatGpt-like experience on top of their data.
In this example, I will use Langchain (which raised $10M at a $100M valuation) to access open AI’s API.
As a Product Manager at Azure Files, the most important information I would like to interact with is the publicly available Azure Files documentation. I downloaded this in the form of a PDF for this exercise. If you are following along download whatever information you want to build your chatbot on in the form of one or more PDFs. You can also use other formats but I will be sticking to PDF format in this example.
The process in which we will build this chatbot is often referred to as retrieval augmented generation (RAG). The following image explains the different steps involved in creating the chatbot that will help me do my job better and faster.
Let’s write the code to create our chatbot. I am using Langchain along with OpenAI to create the chatbot. You would need an OpenAI secret key and IDE of your choice to follow along. I am using VS Code and a virtual Python environment.
The first step is to load data from the folder data using document loaders available in langchain. Langchain provides data loaders for 90+ sources, so you can load data not just from PDFs but anything that you want.
Split the data into smaller chunks with a chunk size of 1500 & chunk overlap of 150 which means each consecutive chunk will have 150 tokens common with the previous chunk to make sure the context doesn’t get split abruptly. There are different ways to split your data and each of them has different pros & cons. Check out Langchain splitters to get more ideas on which splitter to use.
In this step, we convert each split into an embedding vector. An embedding vector is a mathematical representation of text.
An embedding vector of a text block captures the context/meaning of the text. Texts that have similar meanings/contexts will have similar vectors. To convert the splits into their embedding vector versions we use OpenAIEmbeddings. A special type of database is required to store these embeddings, known as a vector database. In our example, we will chrome it since it can be stored in memory. There are other options you can use like pinecone, weaviate & more. After storing embedding for my splits in the chroma vector db I also persist it to reuse in the next steps.
We have stored our embedding in the chroma db. In our chatbot experience when the user asks a question. We send that question to the vector db and retrieve data splits that might have the answer. Retrieving the right data splits is a huge and evolving subject. There are a lot of ways you can retrieve based on your application needs. I am using a common method of retrieval called Maximal Marginal Relevance (MMR). You can learn more techniques like basic semantic similarity, LLM-aided retrieval & more. I will write a separate post talking about MMR and others in a separate post. For this post consider retrieval as a process of getting the top 3 data chunks that could have the context/answer for the question that a user asked the chatbot. Once we retrieve the relevant chunks, we pass them to Open AI LLM and ask it to generate an answer by using the prompt. See my previous post about writing good prompts.
The result I got is very accurate. Not only did the LLM correctly identify that there is no feature called polling it also found a contextually relevant feature called change detection which is similar to what polling refers to in a lot of products.
If you need the full code for this and the images are not helping, reach out to me on Twitter. That’s it for Day 6 of 100 Days of AI!
If you understand RAG and the concepts associated with it here like embeddings, vector databases, and retrieving techniques you can generate a lot of ideas to build interesting chatbots. Feel free to reach out to me and share those ideas with me.
AI PRODUCT IDEA ALERT 1: All organizations would want to chat with your data applications and would also want their employees to create customer chats with your data applications based on their needs without writing any code. Microsoft and other companies are launching products and features to enable large organizations to do this via Azure OpenAI. But I think there will be startups competing for this space as well.
AI PRODUCT IDEA ALERT 2: Chat GPT for doctors. It won’t be trained on the internet, but instead on curated data that is from all relevant textbooks, recent research, and best practices as put forward by a respectable board, etc., I see immense value in a properly curated LLM tuned for doctors.
AI PRODUCT IDEA ALERT 3: Similar to idea 2. Think of LLMs fine-tuned for educational use cases. Where all the info is vetted and accurate. Moreso, I think other verticals need curated data sets instead of the whole internet.
I write a newsletter called Above Average where I talk about the second-order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, subscribe to it.
Follow me on Twitter, LinkedIn for latest updates on 100 days of AI. If you are in tech you might be interested in joining my community of tech professionals here.
Also published here.