Since ChatGPT captured the public imagination in early 2023, there has been an explosion of interest in commercializing large language model-based applications. One of the most interesting applications has been the creation of expert chat systems that can answer natural language queries from a database of proprietary knowledge base.
One of the most popular techniques in this space is [retrieval-augmented generation](https://retrieval augmented generation aws), or RAG, which uses document embeddings to find items relevant to the user query before using a large language model to generate an answer.
The technique is extremely powerful as it allows for extremely cheap and fast lookups, provides extreme flexibility for the knowledge base to change and evolve over time, and highly informed and accurate answers that greatly reduce hallucinations and mistakes.
For a more in-depth analysis of RAG systems and how to implement one, you can read my previous article here.
As powerful as RAG systems can be, there are some severe limitations to the architecture. We explored some of the limitations in my previous article and proposed ways to improve the architecture.
Today, we will explore another limitation with the embedding-based architecture and propose a way to get around the limitations of the architecture.
Suppose that we are a publication that wants to create a chat interface that allows readers and clients to ask questions.
Of course, we will be able to answer questions like “What is your opinion on X?” or “What have you said about Y?” with a simple RAG implementation, but a RAG architecture really starts to struggle when you deal with questions like “What did you say about X in 2021?” or “How has your coverage of Y changed between 2021 and 2023?”
One of the challenges with an embeddings-based RAG is that embeddings models are generally unable to encode metadata in a systematic way, and therefore any lookups that require knowledge of things like publication date or author name will give your RAG system quite a bit of trouble.
We can solve this problem by leveraging one of the most exciting features of large language models - code generation. We will look at a real-world publication, design an LLM-based algorithm that improves on the RAG architecture, and build a chatbot based on the algorithm.
Today, we will be looking at the CB Insights Newsletter, a popular daily newsletter covering startups and technology. As a former full-stack developer at CB Insights, I often looked forward to the founder’s unique brand of wit and insight at the end of the workday.
Today, we will use the CB Insights newsletter archive as the base data to build a chatbot that can answer metadata-based natural language queries in a way that is superior to a vanilla Embeddings-based RAG implementation.
Specifically, we want the chatbot to be able to answer questions like:
Let’s get to it!
To accomplish this task, we will use the following technologies:
If you’ve followed my other articles, it should be no surprise that I will be using Python for most of the code in this article. Python has excellent web scraping, data processing, and OpenAI integration, all of which we will leverage for our project today.
SQL is the query language that allows users to interact with several major relational databases, including SQLite, MySQL, PostgreSQL, and SQL Server. The language is a set of instructions to the database on how to retrieve, combine, and manipulate data before returning it to the user.
LLM code generation is a technique that has gotten extensive attention over the last few months, as several foundation models including GPT 3.5, GPT 4, and LLaMa 2 have demonstrated the ability to generate code of surprising complexity in response to natural language queries.
Specially trained and tuned systems, such as GitHub’s Copilot, are able to write remarkably intelligent code by using models that are specifically designed for code generation, but a properly prompted general-purpose GPT model already has exceptional capabilities when it comes to writing code.
Semantic embedding is the backbone of most RAG implementations. By using a series of natural language techniques, we can turn natural language text into a vector of numbers that represent the text’s content within a semantic vector space.
We can then use vector algebra to manipulate these embeddings, allowing us to determine the relationship between two text corpora using mathematical methods.
With 1.7 trillion parameters, GPT-4 is simply the most powerful transformer-based large language model available on the market today. GPT-4 is capable of understanding large bodies of text, complex reasoning, and the generation of long-cogent answers in response to difficult prompts.
GPT-3.5, GPT-4’s much smaller cousin, is the model that powered ChatGPT when it took the world by storm. It is capable of processing incredibly complex prompts, and what it lacks in pure reasoning ability, it makes up for in speed and cost savings.
For simpler tasks, GPT3.5 strikes the balance between performance and accuracy.
Before we build our AI, we need to get the data. To do this, we can use CB Insights’s newsletter archive page [https://www.cbinsights.com/newsletter/] which has a collection of past newsletters.
To get all of the links, we can use Python’s requests and beautiful soup library like so:
import requests
from bs4 import BeautifulSoup
res = requests.get('https://www.cbinsights.com/newsletter/')
soup = BeautifulSoup(res.text)
article_links = [[i.text, i['href']] for i in soup.find_all('a') if 'campaign-archive' in i['href'] ]
Once we have the links, we can go to each of the links, and download the article HTML. With Python’s list comprehension, we can do this in one line:
article_soups = [BeautifulSoup(requests.get(link[1]).text) for link in article_links]
This will take a while, but eventually, all of the links should be scraped.
Now, we can use BeautifulSoup to extract the relevant sections:
import re
# SEO optimizations cause some articles to appear twice so we dedupe them.
# We also remove some multiple newlines and unicode characters.
def get_deduped_article_tables(article_table):
new_article_tables = []
for i in article_table:
text_content = re.sub(r'\n{2,}', '\n', i.replace('\xa0', '').strip())
if text_content not in new_article_tables or text_content == '':
new_article_tables.append(text_content)
return new_article_tables
result_json = {}
for soup_meta, soup_art in zip(article_links, article_soups):
article_tables = []
cur_article = []
for table in soup_art.find_all('table'):
if table.attrs.get('mc:variant') == 'Section_Divider':
article_tables.append(get_deduped_article_tables(cur_article))
cur_article = []
else:
cur_article.append(table.text)
article_tables.append(get_deduped_article_tables(cur_article))
result_json[soup_meta[0]] = article_tables
Let’s do some more processing, and turn it into a DataFrame:
import pandas as pd
result_rows = []
for article_name, article_json in result_json.items():
article_date = article_json[0][1]
for idx, tbl in enumerate(article_json[1:]):
txt = '\n'.join(tbl).strip()
if txt != '':
result_rows.append({
'article_name': article_name, 'article_date': article_date, 'idx': idx, 'text': txt,
})
df = apd.DataFrame(result_rows)
If you inspect the dataframe, you should see something like the below:
While we have the data, let’s also generate the embeddings for the articles. With OpenAI’s ada embedding model, this is quite easy.
import openai
EMBEDDING_MODEL = "text-embedding-ada-002"
openai.api_key = [YOUR KEY]
df['embedding'] = df['text'].map(lambda txt: openai.Embedding.create(model=EMBEDDING_MODEL, input=[txt])['data'][0]['embedding'])
Now that we have the data we are using for this exercise, let’s load the data into a database. For this exercise, we will use SQLite, which is a lightweight, self-contained database system that comes packaged with Python.
Please note that in a production environment, you will likely want to use a proper database instance such as MySQL or PostgreSQL with minor tweaks to the SQL we’re using here, but the general technique will remain the same.
To instantiate and load the database, simply run the following in Python. Note that in addition to the article text and the embeddings, we are also saving some metadata, namely the publication date.
Also note that SQLite3, unlike most other SQL databases, uses a dynamic typing system, so we do not have to specify the data types in the create query.
import sqlite3
import json
con = sqlite3.connect("./cbi_article.db")
cur = con.cursor()
cur.execute("CREATE TABLE article(name, date, idx, content, embedding_json)")
con.commit()
rows = []
for _, row in df.iterrows():
rows.append([row['article_name'], row['article_date'], row['idx'], row['text'], json.dumps(row['embedding'])])
cur.executemany("INSERT INTO article VALUES (?, ?, ?, ?, ?)", rows)
con.commit()
And let’s try querying the data.
res = cur.execute("""
SELECT name, date, idx
FROM article
WHERE date >= DATE('now', '-2 years');
""")
res.fetchall()
Should yield something like:
Looking pretty good!
Now that we have the data loaded in the SQLite database, we can get to the next stage. Remember that one of the challenges with embeddings-only RAG implementation is the lack of flexible metadata lookup capabilities.
However, now that we have the metadata loaded into a SQL database, we can use GPT’s code generation capabilities to perform flexible metadata lookups.
To generate SQL code, we can use some simple prompt engineering.
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a SQL query writer that can construct queries based on incoming questions. Answer with only the SQL query."},
{"role": "user", "content": """
Suppose we have the SQLite database table called "article" with the following columns, which contains newsletter articles from a publication:
name, date, idx, content, embedding_json
Write a question that would retrieve the rows necessary to answer the following user question.
Only filter on date. Do not filter on any other column. Make sure the query returns every row in the table by name.
Reply only with the SQL query.
User question: What did you say about the future of the fintech industry in summer of 2022?
"""},
]
)
Notice the following prompt engineering 1) we give the database schema but keep it simple. 2) We specify the return columns. 3) We specify the columns that are available for filtering. 4) We specify the SQL flavor. This prompt should generate SQL code like the one below:
SELECT * FROM article WHERE date BETWEEN '2022-06-01' AND '2022-08-31'
Now, because the return value is not deterministic, sometimes you will end up with idiosyncrasies in your generated code. To handle these conditions, we can simply have a try-catch loop to try and regenerate data. We don’t want to do this infinitely, of course, so if we can’t generate a proper SQL in three tries, we will just quit and fall back to vanilla RAG.
We can implement the filter like so:
res = []
for i in range(3):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a SQL query writer that can construct queries based on incoming questions. Answer with only the SQL query."},
{"role": "user", "content": """
Suppose we have the SQLite database table called "article" with the following columns, which contains newsletter articles from a publication:
name, date, idx, content, embedding_json
Write a question that would retrieve the rows necessary to answer the following user question.
Only filter on date. Do not filter on any other column. Make sure the query returns every row in the table by name.
Reply only with the SQL query.
User question: What did you say about the future of the fintech industry in summer of 2022?
"""},
]
)
generated_query = response.choices[0].message['content']
is_query_safe = True
for no_use_word in {'DELETE', 'UPDATE', 'DROP'}:
if no_use_word in generated_query.upper():
is_query_safe = False
if not is_query_safe:
break # the user input is likely malicious. Try to answer the question with vanilla RAG
res = cur.execute(generated_query).fetchall()
if len(res) > 0:
break
if len(res) == 0:
# vanilla RAG in memory. Use a vector DB in production please.
res = cur.execute('''SELECT * FROM articles''').fetchall()
This is a relatively crude filter, so in production use cases, you will likely want to run more checks for relevance and SQL correctness, but this is sufficient for our example.
A quick note on AI security, we should be careful when running code that is returned from an AI, especially if user input was used as part of the prompt.
If we do not sanitize the output, we leave ourselves vulnerable to prompt-engineering attacks where the user tries to manipulate the AI into generating update or delete statements.
Therefore, we should always check to make sure the output is as we expect before running the code on our computer.
Run the following code to see the retrieved result:
df = pd.DataFrame([{c[0]: v for c, v in zip(cur.description, row)} for row in res])
And now, you should see the following result:
Now that we have the result of the metadata lookup, the rest is simple. We calculate the cosine similarity for all of the retrieved results like so:
from openai.embeddings_utils import cosine_similarity
q_embed = openai.Embedding.create(model=EMBEDDING_MODEL, input=[user_question])['data'][0]['embedding']
df['cosine_similarity'] = df['embedding_json'].map(lambda js: cosine_similarity(json.loads(js), q_embed))
And now, we can take the top 10 newsletters and use prompt engineering to answer the question. We chose 10 here because each newsletter excerpt is relatively short.
If you’re working with articles that are longer, you will want to potentially use fewer articles, or use the technique covered in the bonus section.
answer_prompt = '''
Consider the following newsletter excerpts from the following dates:
'''
for _, row in df.sort_values('cosine_similarity', ascending=False).iloc[:10].iterrows():
answer_prompt += """
======= Date: %s ====
%s
=====================
""" % (row['date'], row['content'])
answer_prompt += """
Answer the following question: %s
""" % user_question
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a tech analyst that can summarize the content of newsletters"},
{"role": "user", "content": answer_prompt},
]
)
This should give you a result similar to the following:
The Future of Fintech was discussed in various newsletters over the summer of 2022. There was a notable slowdown in Fintech as Q2'22 funding plummeted towards 2020 levels, after a high in 2021. The Q2’22 Report highlighted the decline in global fintech investments.
However, fintech's future appeared promising as a significant shift toward early-stage startups was noted, particularly in the payments space. Global investment in the payments sector fell 43% from Q1’22 to $5.1B in Q2’22, owing to funding returning to normal after 2021's peaks.
New entrants in this realm attracted a higher share of deals (63%) so far in 2022, marking investor interest in start-ups. Also reported was the increasing competition Fintech was causing to retail banks, forcing them to prioritize the digitization of core services.
The banking sector responded by focusing on customer experience enhancements, particularly Mobile Banking, using technologies such as chatbots and customer analytics platforms. All this points to a vibrant and competitive FinTech industry ahead.
Which is quite good! If you fact-check the answer by looking at the answer prompt, you’ll notice that the statistics are all coming from the source material. You will also notice that there is some idiosyncratic formatting in the source material that GPT4 was able to ignore. This shows LLMs’ flexibility in data summarization systems.
One of the issues you may encounter in this process is that, when the corpus is extremely large, the final prompt can be very large. This can be costly if you’re using GPT-4, but a very long prompt also has a chance to confuse the model.
To solve this, we can pre-process the individual articles with GPT-3.5, compressing the final prompt we’re sending to GPT-4 at the final step.
summarization_prompt = '''
Summarize the following passage and extract only portions that are relevant to answering the user question.
Passage:
=======
%s
=======
User Questions: %s
''' (row['content'], user_question)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a summarizer of tech industry reports"},
{"role": "user", "content": summarization_prompt},
]
)
We can then put the summaries into the prompt, with significant savings over putting the pure article into the final prompt.
Now that we have the Python code written, let’s package this as a simple web application.
It’s relatively simple to package the code as a backend API with Flask. Simply create a function, and link it up to Flask like so:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import sqlite3
import json
import openai
from openai.embeddings_utils import cosine_similarity
from flask import Flask, request, jsonify
from flask_cors import CORS
app = Flask(__name__)
CORS(app)
EMBEDDING_MODEL = "text-embedding-ada-002"
openai.api_key = [Your OpenAI Key]
db_location = [Location of your SQLite DB]
def process_user_query(user_question):
con = sqlite3.connect(db_location)
cur = con.cursor()
user_question = 'What did you say about the future of the fintech industry in summer of 2022?'
res = []
for i in range(3):
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a SQL query writer that can construct queries based on incoming questions. Answer with only the SQL query."},
{"role": "user", "content": """
Suppose we have the SQLite database table called "article" with the following columns, which contains newsletter articles from a publication:
name, date, idx, content, embedding_json
Write a question that would retrieve the rows necessary to answer the following user question.
Only filter on date. Do not filter on any other column. Make sure the query returns every row in the table by name.
Reply only with the SQL query.
User question: What did you say about the future of the fintech industry in summer of 2022?
"""},
]
)
generated_query = response.choices[0].message['content']
is_query_safe = True
for no_use_word in {'DELETE', 'UPDATE', 'DROP'}:
if no_use_word in generated_query.upper():
is_query_safe = False
if not is_query_safe:
break # the user input is likely malicious. Try to answer the question with vanilla RAG
res = cur.execute(generated_query).fetchall()
if len(res) > 0:
break
if len(res) == 0:
# vanilla RAG in memory. Use a vector DB in production please.
res = cur.execute('''SELECT * FROM articles''').fetchall()
df = pd.DataFrame([{c[0]: v for c, v in zip(cur.description, row)} for row in res])
q_embed = openai.Embedding.create(model=EMBEDDING_MODEL, input=[user_question])['data'][0]['embedding']
df['cosine_similarity'] = df['embedding_json'].map(lambda js: cosine_similarity(json.loads(js), q_embed))
answer_prompt = '''
Consider the following newsletter excerpts from the following dates:
'''
for _, row in df.sort_values('cosine_similarity', ascending=False).iloc[:10].iterrows():
answer_prompt += """
======= Date: %s ====
%s
=====================
""" % (row['date'], row['content'])
answer_prompt += """
Answer the following question: %s
""" % user_question
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a tech analyst that can summarize the content of newsletters"},
{"role": "user", "content": answer_prompt},
]
)
return response.choices[0].message['content']
@app.route('/process_user_question', methods=["POST"])
def process_user_question():
return jsonify({
'status': 'success',
'result': process_user_query(request.json['user_question'])
})
app.run()
And that is actually all we need to do for the backend!
Because we only have one endpoint and don’t need a lot of state in our application, the frontend code should be pretty simple. Remember in a past article, we set up a React application with routing that allows us to render components at specific routes.
Simply follow the instructions in that article to set up a React.JS project, and add the following component on a route of your choice:
import React, {useState, useEffect} from 'react';
import axios from 'axios';
const HNArticle = () => {
const [result, setResult] = useState('');
const [message, setMessage] = useState('');
const [question, setQuestion] = useState('');
const askQuestion = () => {
axios.post("http://127.0.0.1:5000/process_user_question", {user_question: question})
.then(r => r.data)
.then(d => {
console.log(d);
setResult(d.result);
});
}
return <div className="row" style={{marginTop: '15px'}}>
<div className="col-md-12" style={{marginBottom: '15px'}}>
<center>
<h5>Hackernoon CB Insights Demo</h5>
</center>
</div>
<div className="col-md-10 offset-md-1 col-sm-12 col-lg-8 offset-lg-2" style={{marginBottom: '15px'}}>
<ul className="list-group">
<li className="list-group-item">
<h6>Your Question</h6>
<p><input className="form-control" placeholder="Question" value={question} onChange={e => setQuestion(e.target.value)} /></p>
<p>{message}</p>
<p>
<button className="btn btn-primary" onClick={askQuestion}>Ask</button>
</p>
</li>
{result? <li className="list-group-item">
<h6>Response</h6>
{result.split("\n").map((p, i) => <p key={i}>{p}</p>)}
</li>: ''}
</ul>
</div>
</div>;
}
export default HNArticle;
Run the code, and you should see an interface like this:
Ask a question, and after a little while, you should see the output:
And voila! We have successfully built a chatbot with advanced query capabilities beyond a vanilla RAG system!
In today’s article, we built a chatbot with powerful code-generation capabilities. This is one example of a new class of LLM applications that are being built by many AI pioneers now, which can leverage data, programming languages, and natural language comprehension to build generative AI systems with specialist know-how.
These specialist systems are the key to unlocking commercial viability for LLM applications looking to provide value beyond what is directly available from platform providers like OpenAI and Anthropic.
Code generation is just one of the techniques that was made possible by the recent generation of commercially available large language models.
If you have ideas on how LLMs can be commercialized or would like to have a conversation about AI, please do not hesitate to reach out on LinkedIn or GitHub. I have had many insightful conversations with readers over the course of the last year and look forward to many more!