Previous Piece — 100 Days of AI Day 6: Retrieval Techniques and Their Use Cases One of the use cases everyone wanted to see after ChatGpt broke out to the public was to have a ChatGpt-like experience on top of their data. In this example, I will use (which raised $10M at a $100M valuation) to access open AI’s API. Langchain As a Product Manager at Azure Files, the most important information I would like to interact with is the publicly available . I downloaded this in the form of a PDF for this exercise. If you are following along download whatever information you want to build your chatbot on in the form of one or more PDFs. You can also use other formats but I will be sticking to PDF format in this example. Azure Files documentation The process in which we will build this chatbot is often referred to as retrieval augmented generation ( ). The following image explains the different steps involved in creating the chatbot that will help me do my job better and faster. RAG Let’s write the code to create our chatbot. I am using along with OpenAI to create the chatbot. You would need an OpenAI secret key and IDE of your choice to follow along. I am using VS Code and a virtual Python environment. Langchain Step 1 – Load the PDFs The first step is to load data from the folder data using document loaders available in langchain. Langchain provides data loaders for 90+ sources, so you can load data not just from PDFs but anything that you want. Step 2 – Splitting Split the data into smaller chunks with a chunk size of 1500 & chunk overlap of 150 which means each consecutive chunk will have 150 tokens common with the previous chunk to make sure the context doesn’t get split abruptly. There are different ways to split your data and each of them has different pros & cons. Check out Langchain splitters to get more ideas on which splitter to use. Step 3 – Store In this step, we convert each split into an embedding vector. An embedding vector is a mathematical representation of text. An embedding vector of a text block captures the context/meaning of the text. Texts that have similar meanings/contexts will have similar vectors. To convert the splits into their embedding vector versions we use . A special type of database is required to store these embeddings, known as a vector database. In our example, we will chrome it since it can be stored in memory. After storing embedding for my splits in the chroma vector db I also persist it to reuse in the next steps. OpenAIEmbeddings There are other options you can use like pinecone, weaviate & more. Steps 4 & 5 – Retrieval & Output Generation We have stored our embedding in the chroma db. In our chatbot experience when the user asks a question. We send that question to the vector db and retrieve data splits that might have the answer. Retrieving the right data splits is a huge and evolving subject. There are a lot of ways you can retrieve based on your application needs. I am using a common method of retrieval called Maximal Marginal Relevance (MMR). You can learn more techniques like , & . I will write a separate post talking about MMR and others in a separate post. For this post consider retrieval as a process of getting the top 3 data chunks that could have the context/answer for the question that a user asked the chatbot. Once we retrieve the relevant chunks, we pass them to Open AI LLM and ask it to generate an answer by using the . basic semantic similarity LLM-aided retrieval more prompt. See my previous post about writing good prompts The result I got is very accurate. Not only did the LLM correctly identify that there is no feature called polling it also found  a contextually relevant feature called change detection which is similar to what polling refers to in a lot of products. If you need the full code for this and the images are not helping, reach out to me on . That’s it for Day 6 of 100 Days of AI! Twitter If you understand RAG and the concepts associated with it here like embeddings, vector databases, and retrieving techniques you can generate a lot of ideas to build interesting chatbots. Feel free to reach out to me and share those ideas with me. All organizations would want to chat with your data applications and would also want their employees to create customer chats with your data applications based on their needs without writing any code. and other companies are launching products and features to enable large organizations to do this via Azure OpenAI. But I think there will be startups competing for this space as well. AI PRODUCT IDEA ALERT 1: Microsoft : Chat GPT for doctors. It won’t be trained on the internet, but instead on curated data that is from all relevant textbooks, recent research, and best practices as put forward by a respectable board, etc., I see immense value in a properly curated LLM tuned for doctors. AI PRODUCT IDEA ALERT 2 Similar to idea 2. Think of LLMs fine-tuned for educational use cases. Where all the info is vetted and accurate. Moreso, I think other verticals need curated data sets instead of the whole internet. AI PRODUCT IDEA ALERT 3: I write a newsletter called Above Average where I talk about the second-order insights behind everything that is happening in big tech. If you are in tech and don’t want to be average, . subscribe to it Follow me on , for latest updates on 100 days of AI. If you are in tech you might be interested in joining my community of tech professionals . Twitter LinkedIn here Also published here.

Meet the Writer: HackerNoon Contributor Nataraj Sindam on Experimenting With AI 

100 Days of AI Day 6: Retrieval Techniques and Their Use Cases

100 Days of AI Day 7: Building Your Own ChatGPT with Langchain

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

About the $220K+ Per Month Patreon Channel You Never Heard Of

100 Days of AI Day 4: Maximizing Productivity & Creativity with ChatGPT

100 Days of AI Day 6: Retrieval Techniques and Their Use Cases

10 Tips to Get the Most out of ChatGPT

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

10 Best AI Content Generation Tools for All Your Content Needs in 2022

About the $220K+ Per Month Patreon Channel You Never Heard Of

100 Days of AI Day 4: Maximizing Productivity & Creativity with ChatGPT

100 Days of AI Day 6: Retrieval Techniques and Their Use Cases

10 Tips to Get the Most out of ChatGPT

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

10 Best AI Content Generation Tools for All Your Content Needs in 2022

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps