Hello everyone! Recently, I applied an interesting solution during my practice that I've wanted to try for a long time, and now I'm ready to explain how you can create something similar for any other task. We will be talking about creating a customized version of ChatGPT that answers questions, taking into account a large knowledge base that is not limited in length by the size of the prompt (meaning you wouldn't be able to simply add all the information before each question to ChatGPT).
To achieve this, we will use contextual embeddings from OpenAI (for a truly high-quality search of relevant questions from the knowledge base) and the ChatGPT API itself (to format the answers in natural human language).
Additionally, it is assumed that the assistant can answer not only the explicitly stated Q&A questions, but also questions that a person familiar with the Q&A could answer. If you're interested in learning how to create simple bots that respond using a large knowledge base, welcome to the details.
I would like to point out that there are some library projects that try to solve this task in the form of a framework, for example, LangChain, and I also tried using it. However, like any framework that is at an early stage of development, in some cases, it tends to limit rather than simplify things. In particular, from the very beginning of solving this task, I understood what I wanted to do with the data and knew how to do it by myself (including context-based search, setting the correct context in prompts, combining sources of information).
But I couldn't configure the framework to do exactly that with an acceptable level of quality, and debugging the framework seemed like overkill for this task. In the end, I created my own boilerplate code and was satisfied with this approach.
Let me briefly describe the task I was working on, and you can use the same code in your own tasks, replacing the data sources and prompts with the ones that suit you. You will still have full control over the bot's logic.
When writing code, I often use ChatGPT (and I'm not ashamed of it🙂). However, due to the lack of data for 2022+ year, there are sometimes problems with relatively new technologies.
In particular, when developing subgraphs for The Graph protocol (the most popular way to build ETL for retrieving indexed data from EVM-compatible blockchains, you can read more about it in my previous articles [1] and [2]), the libraries themselves have undergone several breaking compatibility changes. The "old" answers from ChatGPT are no longer helpful, and I have to search for the correct answers either in the scarce documentation or, worst case, in the developers' Discord, which is not very convenient (it's not like StackOverflow).
The second part of the problem is that every time you need to provide the conversation context correctly because ChatGPT often veers off the topic of subgraphs, jumping to GraphQL, SQL, or higher mathematics (“The Graph”, “subgraphs”, etc. are not unique terms and have many different interpretations and topics).
Therefore, after a short period of struggling with ChatGPT to correct errors in subgraph code, I decided to create my own SubgraphGPT bot, which will always be in the right context and try to answer, taking into account the knowledge base and messages from developers discord.
PS. I work as a lead product manager at chainstack.com, a Web3 infrastructure provider, and I am responsible for the development of the subgraph hosting service. So I have to work with subgraphs quite a lot, helping users understand this relatively new technology.
In the end, to solve this problem, I decided to use two sources:
A manually compiled knowledge base of questions and answers, selected in semi-blind mode (often I took the topic title from the documentation as the question, and the entire paragraph of information as the answer).
Exported messages from the protocol developers Discord from the past 2 years (to cover the missing period from the end of 2021).
Next, different approaches were used for each source to compose a request to the ChatGPT API, specifically:
For the manually compiled Q&A,
for each question, a contextual embedding is generated (a vector describing this question in a multidimensional state), obtained through the text-embedding-ada-002 model,
then, using a cosine distance search function, the top 3 most similar questions from the knowledge base are found (instead of 3, the most suitable number for your dataset can be used),
the answers to these 3 questions are added to the final prompt with an approximate description of "Use this Q&A snippet only if it is relevant to the given question."
For the messages exported from Discord, the following algorithm was used:
for each message containing a question mark, a contextual embedding is also generated (using the same model),
then, in a similar way, the top 5 most similar questions are selected,
and as context for the answer, the 20 messages following that question are added, which are assumed to have a certain probability of containing the answer to the question,
and this information was added to the final prompt approximately like this: "If you did not find an explicit answer to the question in the attached Q&A snippet, the following chat fragments by the developer may be useful to you for answering the original question ..."
Furthermore, if the topic is not explicitly given, the presence of Q&A snippets and chats can lead to ambiguity in the answers, which may look, for example, as follows:
So, it understands that the question was detached from the context and the answer was also accepted detached from the context. Then it was told that such data can be used, and it summarizes it as follows:
To avoid this, we introduce the concept of a topic, which is explicitly defined and inserted at the beginning of the prompt as:
"I need to get an answer to a question related to the topic 'The Graph subgraph development': {{{what is a subgraph?}}}"
Furthermore, in the last sentence, I also add this:
Finally, only if the above information is not sufficient, you can use your knowledge in the topic 'The Graph subgraph development' to answer the question.
In the end, the complete prompt (excluding the part obtained from chats) looks as follows:
==I need to get an answer to the question related to the topic of "The Graph subgraph development": {{{what is a subgraph?}}}.==
==Possibly, you might find an answer in these Q&As \[use the information only if it is actually relevant and useful for the question answering\]:==
==Q: <What is a subgraph?>==
==A: <A subgraph is a custom API built on blockchain data. Subgraphs are queried using the GraphQL query language and are deployed to a Graph Node using the Graph CLI. Once deployed and published to The Graph's decentralized network, Indexers process subgraphs and make them available to be queried by subgraph consumers.>==
==Q: <Am I still able to create a subgraph if my smart contracts don't have events?>==
==A: <It is highly recommended that you structure your smart contracts to have events associated with data you are interested in querying. Event handlers in the subgraph are triggered by contract events and are by far the fastest way to retrieve useful data. If the contracts you are working with do not contain events, your subgraph can use call and block handlers to trigger indexing. Although this is not recommended, as performance will be significantly slower.>==
==Q: <How do I call a contract function or access a public state variable from my subgraph mappings?>==
==A: <Take a look at Access to smart contract state inside the section AssemblyScript API. https://thegraph.com/docs/en/developing/assemblyscript-api/>==
==Finally, only if the information above was not enough you can use your knowledge in the topic of "The Graph subgraph development" to answer the question.==
The response to the above request with this semi-auto-generated prompt at the input looks correct from the beginning:
In this case, the bot immediately responds with the correct key and adds more relevant information, so the answer doesn't look as straightforward as in Q&A (I remind you that this question is exactly in the list of questions and answers), but with reasonable explanations that partly address the following questions.
I should note right away that there will be a link to the repository at the end, so you can run the bot as is, replacing "topic" with your own, the Q&A knowledge base file with your own, and providing your own API keys for OpenAI and the Telegram bot. So the description here is not intended to fully correspond to the source code on GitHub, but rather to highlight the main aspects of the code.
Let's create a new virtual environment and install the dependencies from requirements.txt:
virtualenv -p python3.8 .venv
source .venv/bin/activate
pip install -r requirements.txt
As mentioned above, it is assumed that there is a list of questions and answers, in this case in the format of an Excel file of the following type:
In order to find the most similar question to the given one, we need to add an embedding of the question (a multidimensional vector in state space) to each line of this file. We will use the add_embeddings.pyfile for this. The script consists of several simple parts.
Importing libraries and reading command line arguments:
import pandas as pd
import openai
import argparse
# Create an Argument Parser object
parser = argparse.ArgumentParser(description='Adding embeddings for each line of csv file')
# Add the arguments
parser.add_argument('--openai_api_key', type=str, help='API KEY of OpenAI API to create contextual embeddings for each line')
parser.add_argument('--file', type=str, help='A source CSV file with the text data')
parser.add_argument('--colname', type=str, help='Column name with the texts')
# Parse the command-line arguments
args = parser.parse_args()
# Access the argument values
openai.api_key = args.openai_api_key
file = args.file
colname = args.colname
Next, reading the file into a pandas dataframe and filtering the questions based on the presence of a question mark. This code snippet is common for handling a knowledge base as well as raw message streams from Discord, so assuming questions are often duplicated, I decided to keep such a simple method of rough non-question filtering.
if file[-4:] == '.csv':
df = pd.read_csv(file)
else:
df = pd.read_excel(file)
# filter NAs
df = df[~df[colname].isna()]
# Keep only questions
df = df[df[colname].str.contains('\?')]
And finally - a function for generating an embedding by calling the API of the model text-embedding-ada-002, a couple of repeated requests since the API can occasionally be overloaded and may respond with an error, and applying this function to each row of the dataframe.
def get_embedding(text, model="text-embedding-ada-002"):
i = 0
max_try = 3
# to avoid random OpenAI API fails:
while i < max_try:
try:
text = text.replace("\n", " ")
result = openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']
return result
except:
i += 1
def process_row(x):
return get_embedding(x, model='text-embedding-ada-002')
df['ada_embedding'] = df[colname].apply(process_row)
df.to_csv(file[:-4]+'_question_embed.csv', index=False)
In the end, this script can be called with the following command:
python add_embeddings.py \
--openai_api_key="xxx" \
--file="./subgraphs_faq.xlsx" \
--colname="Question"
setting up OpenAI API key, the file with the knowledge base, and the name of the column where the question text is located. The final created file, subgraphs_faq._question_embed.csv, contains columns "Question", "Answer", and "ada_embedding".
If you are interested in a simple bot that responds based on manually collected knowledge base only, you can skip this and the following section. However, I will briefly provide code examples here for collecting data from both a Discord channel and a Telegram group. The file discord-channel-data-collection.py consists of two parts. The first part includes importing libraries and initializing command line arguments:
import requests
import json
import pandas as pd
import argparse
# Create an Argument Parser object
parser = argparse.ArgumentParser(description='Discord Channel Data Collection Script')
# Add the arguments
parser.add_argument('--channel_id', type=str, help='Channel ID from the URL of a channel in browser https://discord.com/channels/xxx/{CHANNEL_ID}')
parser.add_argument('--authorization_key', type=str, help='Authorization Key. Being on the discord channel page, start typing anything, then open developer tools -> Network -> Find "typing" -> Headers -> Authorization.')
# Parse the command-line arguments
args = parser.parse_args()
# Access the argument values
channel_id = args.channel_id
authorization_key = args.authorization_key
The second is the function for retrieving data from the channel and saving it into a pandas dataframe, as well as its call with specified parameters.
def retrieve_messages(channel_id, authorization_key):
num = 0
limit = 100
headers = {
'authorization': authorization_key
}
last_message_id = None
# Create a pandas DataFrame
df = pd.DataFrame(columns=['id', 'dt', 'text', 'author_id', 'author_username', 'is_bot', 'is_reply', 'id_reply'])
while True:
query_parameters = f'limit={limit}'
if last_message_id is not None:
query_parameters += f'&before={last_message_id}'
r = requests.get(
f'https://discord.com/api/v9/channels/{channel_id}/messages?{query_parameters}', headers=headers
)
jsonn = json.loads(r.text)
if len(jsonn) == 0:
break
for value in jsonn:
is_reply = False
id_reply = '0'
if 'message_reference' in value and value['message_reference'] is not None:
if 'message_id' in value['message_reference'].keys():
is_reply = True
id_reply = value['message_reference']['message_id']
text = value['content']
if 'embeds' in value.keys():
if len(value['embeds'])>0:
for x in value['embeds']:
if 'description' in x.keys():
if text != '':
text += ' ' + x['description']
else:
text = x['description']
df_t = pd.DataFrame({
'id': value['id'],
'dt': value['timestamp'],
'text': text,
'author_id': value['author']['id'],
'author_username': value['author']['username'],
'is_bot': value['author']['bot'] if 'bot' in value['author'].keys() else False,
'is_reply': is_reply,
'id_reply': id_reply,
}, index=[0])
if len(df) == 0:
df = df_t.copy()
else:
df = pd.concat([df, df_t], ignore_index=True)
last_message_id = value['id']
num = num + 1
print('number of messages we collected is', num)
# Save DataFrame to a CSV file
df.to_csv(f'../discord_messages_{channel_id}.csv', index=False)
if __name__ == '__main__':
retrieve_messages(channel_id, authorization_key)
From the useful information here, there is a detail that I can't find every time I need it - obtaining an authorization key. Considering the channel_id can be obtained from the URL of the Discord channel opened in the browser (the last long number in the link), the authorization_key can only be found by starting to type a message in the channel, then using developer tools to find the event named "typing" in the Network section and extract the parameter from the header.
After receiving these parameters, you can run the following command to collect all messages from the channel (substitute your own values):
python discord-channel-data-collection.py \
--channel_id=123456 \
--authorization_key="123456qwerty"
Since I often download various data from chats/channels in Telegram, I also decided to provide code for this, which generates a similar format (compatible in terms of the add_embeddings.py script) CSV file. So, the telegram-group-data-collection.py script looks as follows. Importing libraries and initializing arguments from the command line:
import pandas as pd
import argparse
from telethon import TelegramClient
# Create an Argument Parser object
parser = argparse.ArgumentParser(description='Telegram Group Data Collection Script')
# Add the arguments
parser.add_argument('--app_id', type=int, help='Telegram APP id from https://my.telegram.org/apps')
parser.add_argument('--app_hash', type=str, help='Telegram APP hash from https://my.telegram.org/apps')
parser.add_argument('--phone_number', type=str, help='Telegram user phone number with the leading "+"')
parser.add_argument('--password', type=str, help='Telegram user password')
parser.add_argument('--group_name', type=str, help='Telegram group public name without "@"')
parser.add_argument('--limit_messages', type=int, help='Number of last messages to download')
# Parse the command-line arguments
args = parser.parse_args()
# Access the argument values
app_id = args.app_id
app_hash = args.app_hash
phone_number = args.phone_number
password = args.password
group_name = args.group_name
limit_messages = args.limit_messages
As you can see, you cannot simply download all the messages from the chat without authorizing yourself as the first person. In other words, besides creating an app through https://my.telegram.org/apps (obtaining APP_ID and APP_HASH), you will also need to use your phone number and password to create an instance of the TelegramClient class from the Telethon library.
Additionally, you will need the public group_name of the Telegram chat and explicitly specify the number of latest messages to be retrieved. Overall, I have done this procedure many times with any number of exported messages without receiving any temporary or permanent bans from the Telegram API, unlike when one sends messages too frequently from one account.
The second part of the script contains the actual function for exporting messages and its execution (with necessary filtering to avoid critical errors that would stop the collection halfway):
async def main():
messages = await client.get_messages(group_name, limit=limit_messages)
df = pd.DataFrame(columns=['date', 'user_id', 'raw_text', 'views', 'forwards', 'text', 'chan', 'id'])
for m in messages:
if m is not None:
if 'from_id' in m.__dict__.keys():
if m.from_id is not None:
if 'user_id' in m.from_id.__dict__.keys():
df = pd.concat([df, pd.DataFrame([{'date': m.date, 'user_id': m.from_id.user_id, 'raw_text': m.raw_text, 'views': m.views,
'forwards': m.forwards, 'text': m.text, 'chan': group_name, 'id': m.id}])], ignore_index=True)
df = df[~df['user_id'].isna()]
df = df[~df['text'].isna()]
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date').reset_index(drop=True)
df.to_csv(f'../telegram_messages_{group_name}.csv', index=False)
client = TelegramClient('session', app_id, app_hash)
client.start(phone=phone_number, password=password)
with client:
client.loop.run_until_complete(main())
In the end, this script can be executed with the following command (replace the values with your own):
python telegram-group-data-collection.py \
--app_id=123456 --app_hash="123456qwerty" \
--phone_number="+xxxxxx" --password="qwerty123" \
--group_name="xxx" --limit_messages=10000
Most of the time, I wrap my pet projects into Telegram bots because it requires minimal effort to launch and immediately shows potential. In this case, I did the same. I must say that the bot code does not contain all the corner cases that I use in the production version of the SubgraphGPTbot, as it has quite a lot of inherited logic from another pet project of mine. Instead, I left the minimum amount of basic code that should be easy to modify for your needs.
The telegram-bot.py script consists of several parts. First, as before, libraries are imported and command line arguments are initialized.
import threading
import telegram
from telegram.ext import Updater, CommandHandler, MessageHandler, Filters
import openai
from openai.embeddings_utils import cosine_similarity
import numpy as np
import pandas as pd
import argparse
import functools
# Create an Argument Parser object
parser = argparse.ArgumentParser(description='Run the bot which uses prepared knowledge base enriched with contextual embeddings')
# Add the arguments
parser.add_argument('--openai_api_key', type=str, help='API KEY of OpenAI API to create contextual embeddings for each line')
parser.add_argument('--telegram_bot_token', type=str, help='A telegram bot token obtained via @BotFather')
parser.add_argument('--file', type=str, help='A source CSV file with the questions, answers and embeddings')
parser.add_argument('--topic', type=str, help='Write the topic to add a default context for the bot')
parser.add_argument('--start_message', type=str, help="The text that will be shown to the users after they click /start button/command", default="Hello, World!")
parser.add_argument('--model', type=str, help='A model of ChatGPT which will be used', default='gpt-3.5-turbo-16k')
parser.add_argument('--num_top_qa', type=str, help="The number of top similar questions' answers as a context", default=3)
# Parse the command-line arguments
args = parser.parse_args()
# Access the argument values
openai.api_key = args.openai_api_key
token = args.telegram_bot_token
file = args.file
topic = args.topic
model = args.model
num_top_qa = args.num_top_qa
start_message = args.start_message
Please note that in this case, you will also need an OpenAI API key, as in order to find the most similar question to the one just entered by the user from the knowledge base, you first need to obtain the embedding of that question by calling the API as we did for the knowledge base itself.
In addition, you will need:
Then follows the loading of the knowledge base file and the initialization of the question embeddings.
# reading QA file with embeddings
df_qa = pd.read_csv(file)
df_qa['ada_embedding'] = df_qa.ada_embedding.apply(eval).apply(np.array)
To make a request to the ChatGPT API, knowing that it sometimes responds with an error due to overload, I use a function with automatic request retry in case of an error.
def retry_on_error(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
max_retries = 3
for i in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
print(f"Error occurred, retrying ({i+1}/{max_retries} attempts)...")
# If all retries failed, raise the last exception
raise e
return wrapper
@retry_on_error
def call_chatgpt(*args, **kwargs):
return openai.ChatCompletion.create(*args, **kwargs)
According to OpenAI's recommendation, before converting the text into embeddings, new lines should be replaced with spaces.
def get_embedding(text, model="text-embedding-ada-002"):
text = text.replace("\n", " ")
return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']
To search for the most similar questions, we calculate the cosine distance between the embeddings of two questions, taken directly from the openai library.
def search_similar(df, question, n=3, pprint=True):
embedding = get_embedding(question, model='text-embedding-ada-002')
df['similarities'] = df.ada_embedding.apply(lambda x: cosine_similarity(x, embedding))
res = df.sort_values('similarities', ascending=False).head(n)
return res
After receiving a list of the most similar question-answer pairs to the given one, you can compile them into one text, marking it in a way that ChatGPT can unambiguously determine what is what.
def collect_text_qa(df):
text = ''
for i, row in df.iterrows():
text += f'Q: <'+row['Question'] + '>\nA: <'+ row['Answer'] +'>\n\n'
print('len qa', len(text.split(' ')))
return text
After that, it is already necessary to gather the "pieces" of the prompt described at the very beginning of the article into one whole.
def collect_full_prompt(question, qa_prompt, chat_prompt=None):
prompt = f'I need to get an answer to the question related to the topic of "{topic}": ' + "{{{"+ question +"}}}. "
prompt += '\n\nPossibly, you might find an answer in these Q&As [use the information only if it is actually relevant and useful for the question answering]: \n\n' + qa_prompt
# edit if you need to use this also
if chat_prompt is not None:
prompt += "---------\nIf you didn't find a clear answer in the Q&As, possibly, these talks from chats might be helpful to answer properly [use the information only if it is actually relevant and useful for the question answering]: \n\n" + chat_prompt
prompt += f'\nFinally, only if the information above was not enough you can use your knowledge in the topic of "{topic}" to answer the question.'
return prompt
In this case, I removed the part using messages from Discord, but you can still follow the logic if chat_prompt != None.
In addition, we will need a function that splits the response received from the ChatGPT API into Telegram messages (no more than 4096 characters):
def telegram_message_format(text):
max_message_length = 4096
if len(text) > max_message_length:
parts = []
while len(text) > max_message_length:
parts.append(text[:max_message_length])
text = text[max_message_length:]
parts.append(text)
return parts
else:
return [text]
The bot starts with a typical sequence of steps, assigning two functions to be triggered by the /start command and receiving a personal message from the user:
bot = telegram.Bot(token=token)
updater = Updater(token=token, use_context=True)
dispatcher = updater.dispatcher
dispatcher.add_handler(CommandHandler("start", start, filters=Filters.chat_type.private))
dispatcher.add_handler(MessageHandler(~Filters.command & Filters.text, message_handler))
updater.start_polling()
The code to respond to /start is straightforward:
def start(update, context):
user = update.effective_user
context.bot.send_message(chat_id=user.id, text=start_message)
And for responding to a free-form message, it's not quite clear.
Firstly, to avoid blocking threads from different users, let's immediately "separate" them into independent processes using the threading library.
def message_handler(update, context):
thread = threading.Thread(target=long_running_task, args=(update, context))
thread.start()
Secondly, all the logic will happen inside the long_running_task function. I intentionally wrapped the main fragments in try/except to easily localize errors when modifying the bot's code.
def long_running_task(update, context):
user = update.effective_user
context.bot.send_message(chat_id=user.id, text='🕰️⏰🕙⏱️⏳...')
try:
question = update.message.text.strip()
except Exception as e:
context.bot.send_message(chat_id=user.id,
text=f"🤔It seems like you're sending not text to the bot. Currently, the bot can only work with text requests.")
return
try:
qa_found = search_similar(df_qa, question, n=num_top_qa)
qa_prompt = collect_text_qa(qa_found)
full_prompt = collect_full_prompt(question, qa_prompt)
except Exception as e:
context.bot.send_message(chat_id=user.id,
text=f"Search failed. Debug needed.")
return
Since there may be errors when replacing the knowledge base and topic with your own, for example, due to formatting, a human-readable error is displayed.
Next, the request is sent to the ChatGPT API with a leading system message that has already proven itself: "You are a helpful assistant." The resulting output is divided into multiple messages if necessary and sent back to the user.
try:
print(full_prompt)
completion = call_chatgpt(
model=model,
n=1,
messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": full_prompt}]
)
result = completion['choices'][0]['message']['content']
except Exception as e:
context.bot.send_message(chat_id=user.id,
text=f'It seems like the OpenAI service is responding with errors. Try sending the request again.')
return
parts = telegram_message_format(result)
for part in parts:
update.message.reply_text(part, reply_to_message_id=update.message.message_id)
That concludes the part with the code.
Now, a prototype of such a bot is available in a limited format at the following link. As the API is paid, you can make up to 3 requests per day, but I don't think it will limit anyone, as the most interesting thing is not a specialized bot focused on a narrow topic, but the code of the AnythingGPT project, which is available on GitHub with a short instruction on how to create your own bot to solve your specific task with your knowledge base based on this example. If you have read until the end, I hope this article has been helpful to you.