I'm sure you've all heard about ChatGPT by now. It's an amazing Large Language Model (LLM) system that is opening up new and exciting innovative capabilities. However, it's been trained over a huge corpus of text from across the internet, but what if you want to query your own file or files?
Thanks to the simple (but powerful!) OpenAI API and the amazing work done by the team at LangChain, we can knock up a basic Question and Answering application that answers questions from your files.
This is all very new technology, so I'm also learning as I go along and am always open to hearing feedback and improvements I can make - feel free to comment!
The goal of the article is to get you started with Questions and Answering your own document(s). However, as described in the Improvements section below, various aspects can be optimized.
If there's enough interest, I can go into more detail about those topics in future articles.
Sound good? Let's get to it! (Full code is on my GitHub)
When dealing with API keys and secrets, I like to use environment variables for security. So in your directory, create a file called ".env" (note the full-stop/period in the beginning).
In the .env file, type OPENAI_API_KEY = '<your secret key from above>'
# [.env file]
OPENAI_API_KEY = 'sk-....' # enter your entire key here
If you're using Git, create a .gitignore file, and add ".env" in the file as we don't want to commit this to our repo on accident and leak our secret key! I've also added "db/" which will be our database folder.
I don't want to commit the database which could contain personal document data so ensuring that doesn't get committed either.
# [.gitignore file]
.env # This will prevent the .env file from being commmitted to your repo
db/ # This will be our database folder. I don't want to commit it so adding here
Install all the required dependencies. Download the requirements.txt file from here and run.
pip3 install -r requirements.txt
Alternatively, you can manually use pip to install the dependencies below:
Let's open our main Python file and load our dependencies. I'm calling the app "ChatGPMe" (sorry, couldn't resist the pun...😁), but feel free to name it what you like.
In this article, I have removed the type annotations for clarity, but the GitHub version contains the strongly typed version (I think it's good practice to add strong typing to Python code; I miss it from C#!).
# dotenv is a library that allows us to securely load env variables
from dotenv import load_dotenv
# used to load an individual file (TextLoader) or multiple files (DirectoryLoader)
from langchain.document_loaders import TextLoader, DirectoryLoader
# used to split the text within documents and chunk the data
from langchain.text_splitter import CharacterTextSplitter
# use embedding from OpenAI (but others available)
from langchain.embeddings import OpenAIEmbeddings
# using Chroma database to store our vector embeddings
from langchain.vectorstores import Chroma
# use this to configure the Chroma database
from chromadb.config import Settings
# we'll use the chain that allows Question and Answering and provides source of where it got the data from. This is useful if you have multiple files. If you don't need the source, you can use RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
# we'll use the OpenAI Chat model to interact with the embeddings. This is the model that allows us to query in a similar way to ChatGPT
from langchain.chat_models import ChatOpenAI
# we'll need this for reading/storing from directories
import os
You may notice that many of the LangChain libraries above end in the plural. This is because LangChain is a framework for apps powered by language models, so it allows numerous different chains, database stores, chat models, and such, not just OpenAI/ChatGPT ones!
This opens up huge possibilities for running offline models, open-source models, and other great features.
We'll load the .env file using dotenv. This library makes it easier and more secure to work with environment files to help secure secret keys and such. You could hardcode the API key directly in your file, but this way is more secure and generally considered good practice.
# looks for the .env file and loads the variable(s)
load_dotenv()
Excellent, we now have our dependencies and API key set up, let's get to the fun bit!
This is optional, but I found it worthwhile. By default, if you don't persist the database, it will be transient which means that the database is deleted when your program ends. Your documents will have to be analyzed every time you run the program.
For a small number of files, it's fine, but can quickly add to the loading time if you need to analyze multiple files every time you run the app. So let's create a couple of variables we'll use to store the database in a folder.
# get the absolute path of this Python file
FULL_PATH = os.path.dirname(os.path.abspath(__file__))
# get the full path with a folder called "db" appended
# this is where the database and index will be persisted
DB_DIR = os.path.join(FULL_PATH, "db")
Let's load the file we want to query. I'm going to query Microsoft's Earnings Call transcript from Q2 2023, but feel free to load whatever document(s) you like.
# use TextLoader for an individual file
# explicitly stating the encoding is also recommmended
doc_loader = TextLoader('MSFT_Call_Transcript.txt', encoding="utf8")
# if you want to load multiple files, place them in a directory
# and use DirectoryLoader; comment above and uncomment below
#doc_loader = DirectoryLoader('my_directory')
# load the document
document = doc_loader.load()
I'll only be using TextLoader, but the syntax is the same for DirectoryLoader, so you can do a drop-in replacement with the load() method.
We've loaded the files but now we need to split the text into what's called chunks. Essentially, chunking allows you to group words into "chunks" to allow more meaning to a sentence. For example, the sentence below in the context of a football (soccer) game:
"The striker scored a goal in the final minute of the game."
One possible way to chunk this sentence is:
However, notice that Chunk 3 and Chunk 4 share the words "final minute" contextually. This is an example of chunk overlap. While this chunking still conveys the essential information of the sentence, it is not as precise as it could be. A better way to chunk the sentence would be:
In this revised version, there is no overlap between the chunks, and each chunk conveys a more distinct and specific idea. Ideally, when you chunk, you choose values that prevent chunk overlap.
However, chunking is a whole topic of its own, so we’ll leave it there. If you want to find out more, you can search for chunking in Natural Language Processing (NLP) where good chunking is critical to the optimum usage of NLP models.
So, with the quick chunking detour above, let's split our document with 512 as a chunk size and 0 as the overlap - feel free to play with these depending on your document.
# obtain an instance of the splitter with the relevant parameters
text_splitter = CharacterTextSplitter(chunk_size=512 , chunk_overlap=0)
# split the document data
split_docs = text_splitter.split_documents(document)
We now want to load the OpenAI embeddings. An embedding is essentially converting language as we use it to numerical values (vectors) so that a computer understands the words and their relationship to other words. Words with similar meanings will have a similar representation.
Like chunking, Embedding is a huge topic but here's a nice article on Word2Vec which is one way to create word embeddings. Let's get back on track with using embeddings created by OpenAI.
# load the embeddings from OpenAI
openai_embeddings = OpenAIEmbeddings()
Simple! Let's now create our Chroma database to store these embeddings. Chroma was written from the ground up to be an AI-native database and works well with LangChain to quickly develop and iterate AI applications.
We'll start by configuring the parameters of the database
# configure our database
client_settings = Settings(
chroma_db_impl="duckdb+parquet", #we'll store as parquet files/DuckDB
persist_directory=DB_DIR, #location to store
anonymized_telemetry=False # optional but showing how to toggle telemetry
)
Now, let's create the actual vector store (i.e., the database storing our embeddings).
# create a class level variable for the vector store
vector_store = None
# check if the database exists already
# if not, create it, otherwise read from the database
if not os.path.exists(DB_DIR):
# Create the database from the document(s) above and use the OpenAI embeddings for the word to vector conversions. We also pass the "persist_directory" parameter which means this won't be a transient database, it will be stored on the hard drive at the DB_DIR location. We also pass the settings we created earlier and give the collection a name
vector_store = Chroma.from_documents(texts, embeddings, persist_directory=DB_DIR, client_settings=client_settings,
collection_name="Transcripts_Store")
# It's key to called the persist() method otherwise it won't be saved
vector_store.persist()
else:
# As the database already exists, load the collection from there
vector_store = Chroma(collection_name="Transcripts_Store", persist_directory=DB_DIR, embedding_function=embeddings, client_settings=client_settings)
We now have our embeddings stored! The final step is to load our chain and start querying.
LangChain, as the name implies, has main chains to use and experiment with. Chains essentially allow you to "chain" together multiple components, such as taking input data, formatting it to a prompt template, and then passing it to an LLM.
You can create your own chains or, as I'm doing here, use pre-existing chains which cover common use cases.
For our case, I'm going to use RetrievalQAWithSourcesChain. As the name implies, it also returns the source(s) used to obtain the answer.
I'm doing this to show that the demo you see above is only using my document and not reaching out to the web for answers (shown by the Google question at the end).
# create and configure our chain
# we're using ChatOpenAI LLM with the 'gpt-3.5-turbo' model
# we're setting the temperature to 0. The higher the temperature, the more 'creative' the answers. In my case, I want as factual and direct from source info as possible
# 'stuff' is the default chain_type which means it uses all the data from the document
# set the retriever to be our embeddings database
qa_with_source = RetrievalQAWithSourcesChain.from_chain_type(
llm=ChatOpenAI(temperature=0, model_name='gpt-3.5-turbo'),
chain_type="stuff",
retriever = vector_store.as_retriever()
)
There are currently four chain types, but we're using the default one, 'stuff', which uses the entire document in one go. However, other methods like map_reduce can help with batching documents so you don't surpass token limits, but that's a whole other topic.
We're almost there! Let's create a quick function that handles the answering of the question, and then create a loop for the user to ask questions to the document.
# function to use our RetrievalQAWithSourcesChain
def query_document(question):
response = qa_with_source({"question": question})
# loop through to allow the user to ask questions until they type in 'quit'
while(True):
# make the user input yellow using ANSI codes
print("What is your query? ", end="")
user_query = input("\033[33m")
print("\033[0m")
if(user_query == "quit"):
break
response = query_document(user_query)
# make the answer green and source blue using ANSI codes
print(f'Answer: \033[32m{response["answer"]}\033[0m')
print(f'\033[34mSources: {response["sources"]}\033[0m')
And that's it! Hope that starts you what is an exciting field of development. Please feel free to comment and provide feedback.
This is just the tip of the iceberg! For me personally, automating and running this with preset prompts across transcripts from various companies can provide good insights to help with trading decisions.
For those interested in the financial/trading aspects of AI, you might like to read my short post on BloombergGPT. There is so much potential for alternative data and fundamentals analysis, it's a very exciting field.
However, outside of that, it's also useful for your own personal files and organization/searching and almost limitless other possibilities!
Specifically, there are several improvements to be made, here are a few:
Offline - This is a big one and maybe a topic for another blog if there's interest. Your data is still sent to OpenAI unless you opt out or use the Azure version of the API which has a more strict usage policy for your data.
A great open-source project called Hugging Face has numerous models and datasets to get your AI projects up and running.
LangChain also supports Hugging Face, so you could start experimenting with using offline Hugging Face models with LangChain to run everything without the internet or API costs.
Automate - Individually querying is useful, but some situations may require a large number of actions or sequential actions. This is where AutoGPT can come in.
As I say, many more features can be explored, but this was my first foray into trying to utilize OpenAI models for my personal documents and trading data. A lot of documentation, bug tickets, and workaround reading was involved, so I hope I've saved you some time!
The full code can be found on my GitHub.
Enjoy :)