https://youtu.be/fo0cg5pAh2 Introduction I'm sure you've all heard about by now. It's an amazing Large Language Model (LLM) system that is opening up new and exciting innovative capabilities. However, it's been trained over a huge corpus of text from across the internet, but what if you want to query your file or files? ChatGPT own Thanks to the simple (but powerful!) OpenAI API and the amazing work done by the team at , we can knock up a basic Question and Answering application that answers questions from files. LangChain your This is all very new technology, so I'm also learning as I go along and am always open to hearing feedback and improvements I can make - feel free to comment! The goal of the article is to get you started with Questions and Answering your own document(s). However, as described in the Improvements section below, various aspects can be optimized. If there's enough interest, I can go into more detail about those topics in future articles. Sound good? Let's get to it! (Full code is on my ) GitHub High-Level Steps Set up our development environment, API Key, and dependencies. Load in our file or directory containing multiple files. Create and persist (optional) our database of embeddings (will briefly explain what they are later). Set up our chain, and ask questions about the document(s) we loaded in. Prerequisites You'll need an (I recommend putting a on pricing so you don't accidentally go over, especially when experimenting with code (you may automatically get free credit for new users, but I've had my account for more than 3 months so those credits expired for me). You can also use the to estimate costs - we'll be using the model in this article) OpenAI API Key hard limit OpenAI calculator gpt-3.5-turbo Developer Environment (of course). I'm using , , and VS Code for the IDE. The requirements.txt file is available . OpenAI Python SDK LangChain in the GitHub repo A file or files to test with. I recommend starting with a single file to test with. As someone with a quant fund background and using this for trading information, I'll be using the (from ) Microsoft Q2 FY23 Earnings Call Transcript this page Set up the OpenAI Key If you haven't done so already, create an account on . OpenAI (Optional but recommended) - Go to and set your Soft and Hard limits. I used £10 but feel free to use whatever you're comfortable with. This prevents you from overspending more than you expected, especially useful when prototyping and experimenting with the API. Billing...Usage Limits... If you haven't got free credits, you may need to enter your payment details to gain access. Head over to the section, and generate a new secret - API Keys Copy this secret before closing the window, otherwise, you won't get a chance to see it in full again. When dealing with API keys and secrets, I like to use environment variables for security. So in your directory, create a file called ".env" (note the full-stop/period in the beginning). In the .env file, type OPENAI_API_KEY = '<your secret key from above>' # [.env file] OPENAI_API_KEY = 'sk-....' # enter your entire key here If you're using Git, create a .gitignore file, and add ".env" in the file as we don't want to commit this to our repo on accident and leak our secret key! I've also added "db/" which will be our database folder. I don't want to commit the database which could contain personal document data so ensuring that doesn't get committed either. # [.gitignore file] .env # This will prevent the .env file from being commmitted to your repo db/ # This will be our database folder. I don't want to commit it so adding here Install all the required dependencies. Download the requirements.txt file and run. from here pip3 install -r requirements.txt Alternatively, you can manually use pip to install the dependencies below: chromadb==0.3.21 langchain==0.0.146 python-dotenv==1.0.0 Let's open our main Python file and load our dependencies. I'm calling the app "ChatGPMe" (sorry, couldn't resist the pun...😁), but feel free to name it what you like. In this article, I have removed the type annotations for clarity, but the GitHub version contains the strongly typed version (I think it's good practice to add strong typing to Python code; I miss it from C#!). # dotenv is a library that allows us to securely load env variables from dotenv import load_dotenv # used to load an individual file (TextLoader) or multiple files (DirectoryLoader) from langchain.document_loaders import TextLoader, DirectoryLoader # used to split the text within documents and chunk the data from langchain.text_splitter import CharacterTextSplitter # use embedding from OpenAI (but others available) from langchain.embeddings import OpenAIEmbeddings # using Chroma database to store our vector embeddings from langchain.vectorstores import Chroma # use this to configure the Chroma database from chromadb.config import Settings # we'll use the chain that allows Question and Answering and provides source of where it got the data from. This is useful if you have multiple files. If you don't need the source, you can use RetrievalQA from langchain.chains import RetrievalQAWithSourcesChain # we'll use the OpenAI Chat model to interact with the embeddings. This is the model that allows us to query in a similar way to ChatGPT from langchain.chat_models import ChatOpenAI # we'll need this for reading/storing from directories import os You may notice that many of the LangChain libraries above end in the plural. This is because LangChain is a framework for apps powered by language models, so it allows numerous different chains, database stores, chat models, and such, not just OpenAI/ChatGPT ones! This opens up huge possibilities for running offline models, open-source models, and other great features. We'll load the .env file using dotenv. This library makes it easier and more secure to work with environment files to help secure secret keys and such. You could hardcode the API key directly in your file, but this way is more secure and generally considered good practice. # looks for the .env file and loads the variable(s) load_dotenv() Excellent, we now have our dependencies and API key set up, let's get to the fun bit! Load the Files and Embeddings This is optional, but I found it worthwhile. By default, if you don't persist the database, it will be transient which means that the database is deleted when your program ends. Your documents will have to be analyzed every time you run the program. For a small number of files, it's fine, but can quickly add to the loading time if you need to analyze multiple files every time you run the app. So let's create a couple of variables we'll use to store the database in a folder. # get the absolute path of this Python file FULL_PATH = os.path.dirname(os.path.abspath(__file__)) # get the full path with a folder called "db" appended # this is where the database and index will be persisted DB_DIR = os.path.join(FULL_PATH, "db") Let's load the file we want to query. I'm going to query , but feel free to load whatever document(s) you like. Microsoft's Earnings Call transcript from Q2 2023 # use TextLoader for an individual file # explicitly stating the encoding is also recommmended doc_loader = TextLoader('MSFT_Call_Transcript.txt', encoding="utf8") # if you want to load multiple files, place them in a directory # and use DirectoryLoader; comment above and uncomment below #doc_loader = DirectoryLoader('my_directory') # load the document document = doc_loader.load() I'll only be using but the syntax is the same for so you can do a drop-in replacement with the method. TextLoader, DirectoryLoader, load() We've loaded the files but now we need to split the text into what's called chunks. Essentially, chunking allows you to group words into "chunks" to allow more meaning to a sentence. For example, the sentence below in the context of a football (soccer) game: "The striker scored a goal in the final minute of the game." One possible way to chunk this sentence is: Chunk 1: "The striker" Chunk 2: "scored" Chunk 3: "a goal in the final minute" Chunk 4: "of the game." However, notice that Chunk 3 and Chunk 4 share the words "final minute" contextually. This is an example of chunk overlap. While this chunking still conveys the essential information of the sentence, it is not as precise as it could be. A better way to chunk the sentence would be: Chunk 1: "The striker" Chunk 2: "scored" Chunk 3: "a goal" Chunk 4: "in the final minute" Chunk 5: "of the game" In this revised version, there is no overlap between the chunks, and each chunk conveys a more distinct and specific idea. Ideally, when you chunk, you choose values that prevent chunk overlap. However, chunking is a whole topic of its own, so we’ll leave it there. If you want to find out more, you can search for chunking in Natural Language Processing (NLP) where good chunking is critical to the optimum usage of NLP models. So, with the quick chunking detour above, let's split our document with 512 as a chunk size and 0 as the overlap - feel free to play with these depending on your document. # obtain an instance of the splitter with the relevant parameters text_splitter = CharacterTextSplitter(chunk_size=512 , chunk_overlap=0) # split the document data split_docs = text_splitter.split_documents(document) We now want to load the OpenAI embeddings. An embedding is essentially converting language as we use it to numerical values (vectors) so that a computer understands the words and their relationship to other words. Words with similar meanings will have a similar representation. Like chunking, Embedding is a huge topic but here's a which is one way to create word embeddings. Let's get back on track with using embeddings created by OpenAI. nice article on Word2Vec # load the embeddings from OpenAI openai_embeddings = OpenAIEmbeddings() Simple! Let's now create our Chroma database to store these embeddings. was written from the ground up to be an AI-native database and works well with LangChain to quickly develop and iterate AI applications. Chroma We'll start by configuring the parameters of the database # configure our database client_settings = Settings( chroma_db_impl="duckdb+parquet", #we'll store as parquet files/DuckDB persist_directory=DB_DIR, #location to store anonymized_telemetry=False # optional but showing how to toggle telemetry ) Now, let's create the actual vector store (i.e., the database storing our embeddings). # create a class level variable for the vector store vector_store = None # check if the database exists already # if not, create it, otherwise read from the database if not os.path.exists(DB_DIR): # Create the database from the document(s) above and use the OpenAI embeddings for the word to vector conversions. We also pass the "persist_directory" parameter which means this won't be a transient database, it will be stored on the hard drive at the DB_DIR location. We also pass the settings we created earlier and give the collection a name vector_store = Chroma.from_documents(texts, embeddings, persist_directory=DB_DIR, client_settings=client_settings, collection_name="Transcripts_Store") # It's key to called the persist() method otherwise it won't be saved vector_store.persist() else: # As the database already exists, load the collection from there vector_store = Chroma(collection_name="Transcripts_Store", persist_directory=DB_DIR, embedding_function=embeddings, client_settings=client_settings) We now have our embeddings stored! The final step is to load our chain and start querying. Create the Chain and Query LangChain, as the name implies, has main chains to use and experiment with. Chains essentially allow you to "chain" together multiple components, such as taking input data, formatting it to a prompt template, and then passing it to an LLM. You can create your own chains or, as I'm doing here, use pre-existing chains which cover common use cases. For our case, I'm going to use . As the name implies, it also returns the source(s) used to obtain the answer. RetrievalQAWithSourcesChain I'm doing this to show that the demo you see above is only using my document and not reaching out to the web for answers (shown by the Google question at the end). # create and configure our chain # we're using ChatOpenAI LLM with the 'gpt-3.5-turbo' model # we're setting the temperature to 0. The higher the temperature, the more 'creative' the answers. In my case, I want as factual and direct from source info as possible # 'stuff' is the default chain_type which means it uses all the data from the document # set the retriever to be our embeddings database qa_with_source = RetrievalQAWithSourcesChain.from_chain_type( llm=ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo'), chain_type="stuff", retriever = vector_store.as_retriever() ) There are currently , but we're using the default one, 'stuff', which uses the entire document in one go. However, other methods like map_reduce can help with batching documents so you don't surpass token limits, but that's a whole other topic. four chain types We're almost there! Let's create a quick function that handles the answering of the question, and then create a loop for the user to ask questions to the document. # function to use our RetrievalQAWithSourcesChain def query_document(question): response = qa_with_source({"question": question}) # loop through to allow the user to ask questions until they type in 'quit' while(True): # make the user input yellow using ANSI codes print("What is your query? ", end="") user_query = input("\033[33m") print("\033[0m") if(user_query == "quit"): break response = query_document(user_query) # make the answer green and source blue using ANSI codes print(f'Answer: \033[32m{response["answer"]}\033[0m') print(f'\033[34mSources: {response["sources"]}\033[0m') And that's it! Hope that starts you what is an exciting field of development. Please feel free to comment and provide feedback. Improvements This is just the tip of the iceberg! For me personally, automating and running this with preset prompts across transcripts from various companies can provide good insights to help with trading decisions. For those interested in the financial/trading aspects of AI, you might like to read my short post on . There is so much potential for alternative data and fundamentals analysis, it's a very exciting field. BloombergGPT However, outside of that, it's also useful for your own personal files and organization/searching and almost limitless other possibilities! Specifically, there are several improvements to be made, here are a few: Offline - This is a big one and maybe a topic for another blog if there's interest. Your data is still sent to OpenAI unless you opt out or use the Azure version of the API which has a more strict usage policy for your data. A great open-source project called has numerous models and datasets to get your AI projects up and running. Hugging Face LangChain also supports Hugging Face, so you could start experimenting with using offline Hugging Face models with LangChain to run everything without the internet or API costs. Automate - Individually querying is useful, but some situations may require a large number of actions or sequential actions. This is where can come in. AutoGPT Chunking - I've hardcoded 512, and you may have seen messages saying that some of the chunking surpassed that. An improvement would be to use more dynamic chunking numbers tailored to the input documents. Token Management and Prompt Templates - Tokens are key to the API, and you can optimize them such that you don't waste unnecessary tokens in your API call and still get the same results. This saves you money as you're using less of the limit and also allows more tailored prompts to provide better answers. As I say, many more features can be explored, but this was my first foray into trying to utilize OpenAI models for my personal documents and trading data. A lot of documentation, bug tickets, and workaround reading was involved, so I hope I've saved you some time! The full code can be found on my . GitHub Enjoy :)