Modern applications often incorporate search solutions to enable users to quickly access existing content on demand. It is difficult to envision any other functionality that could efficiently retrieve this information, making search an essential feature in most applications.
Simultaneously, even if the need to query and search is very common, different applications take drastically different approaches.
In most cases, companies remain at a very basic level of search by simply querying OLTP databases directly. Requests may look like this: SELECT id, title FROM, entities WHERE, description LIKE ‘%bow%
.
However, more frequently, they are represented in complex, multiple-level table join structures which are impossible to read, slow, and primitive. They are unable to comprehend context, require numerous customizations, and are very challenging to implement properly.
While it is possible to improve query execution time through materialized views, query caching, and other technics, the added complexity results in a considerable lag between primary updates and consistent search results.
More efficient alternatives to primitive DB-based search solutions may constitute open-source search engines such as Apache Lucene, Apache Solr, Elasticsearch, Sphinx, MeiliSearch, Typesense, etc.
These tend to be comparatively faster and much better at dealing with complex queries and working with filters. But once these search engines are compared to counterparts like Google Search or DuckDuckGo, it becomes clear that open-source solutions fail at building proper search context and query modalities — they are incapable of understanding a query if the user provides a vague search request.
Imagine you simply cannot remember what that yellow citrus fruit with a sour taste are called! But you want to search the app for an article on how to grow this mysterious fruit. How do you go about that search?
Your query may be, “How to grow yellow sour citrus indoors”. Any of the aforementioned open-source search engines may struggle significantly to return relevant results for this query, even if the database does contain articles about growing “lemons”.
That is because the extraction of meaning from a query is a natural language task and is unlikely to be solved without AI components. GPT-3 is good at this task.
OpenAI offers
The conversion of document text into a vector representative can happen in the background, while the vectorization of the search query should occur during runtime. There are several GPT-3 model families that OpenAI offers:
text-search-ada-doc-001: 1024
text-search-babbage-doc-001: 2048
text-search-curie-doc-001: 4096
text-search-davinci-doc-001: 12288
Higher vector dimensions lead to more embedded information and, thus, also higher costs and slower searches.
Documents are usually long and queries are generally short and incomplete. Therefore, the vectorization of any document differs significantly from any query’s vectorization considering the content’s density and size. OpenAI know that, and so they offer two paired models, -doc
and -query
:
text-search-ada-query-001: 1024
text-search-babbage-query-001: 2048
text-search-curie-queryc-001: 4096
text-search-davinci-query-001: 12288
It is important to note that the query and document must both utilize the same model family and have the same length of the output vector.
It may be easiest to observe and understand the power of this search solution through examples. For this example, let us draw on the
The dataset contains plenty of columns, but our vectorization process will be built around the title and overview columns only.
Title: Harry Potter and the Half-Blood Prince
Overview: As Harry begins his sixth year at Hogwarts, he discovers an old book marked as
‘Property of the Half-Blood Prince’, and begins to learn more about Lord Voldemort’s dark past.
Let’s map the dataset into ready-to-indexing text:
datafile_path = "./tmdb_5000_movies.csv"
df = pd.read_csv(datafile_path)
def combined_info(row):
columns = ['title', 'overview']
columns_to_join = [f"{column.capitalize()}: {row[column]}" for column in columns]
return '\n'.join(columns_to_join)
df['combined_info'] = df.apply(lambda row: combined_info(row), axis=1)
The embedding process is straightforward:
def get_embedding(text, model="text-search-babbage-doc-001"):
text = text.replace("\n", " ")
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']
get_embedding(df['combined_info'][0])
This block of code outputs a list the size of which is equal to the parameters that the model is operating with, which in the case oftext-search-babbage-doc-001
is 2048.
A similar embedding process should be applied to all documents we would like to search on:
df['combined_info_search'] = df['combined_info'].apply(lambda x: get_embedding(x, model='text-search-babbage-doc-001'))
df.to_csv('./tmdb_5000_movies_search.csv', index=False)
Column combined_info_search
will hold a vector representation of the combined_text.
And, surprisingly, that’s already it! Finally, we are ready to perform a sample search query:
from openai.embeddings_utils import get_embedding, cosine_similarity
def search_movies(df, query, n=3, pprint=True):
embedding = get_embedding(
query,
engine="text-search-babbage-query-001"
)
df["similarities"] = df.combined_info_search.apply(lambda x: cosine_similarity(x, embedding))
res = (
df.sort_values("similarities", ascending=False)
.head(n)
.combined_info
)
if pprint:
for r in res:
print(r[:200])
print()
return res
res = search_movies(df, "movie about the wizardry school", n=3)
Title: Harry Potter and the Philosopher’s StoneOverview: Harry Potter has lived under the stairs at his aunt and uncle’s house his whole life. But on his 11th birthday, he learns he’s a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school’s kindly headmaster, Harry uncovers the truth about his parents’ deaths — and about the villain who’s to blame.
Title: Harry Potter and the Goblet of FireOverview: Harry starts his fourth year at Hogwarts, competes in the treacherous Triwizard Tournament and faces the evil Lord Voldemort. Ron and Hermione help Harry manage the pressure — but Voldemort lurks, awaiting his chance to destroy Harry and all that he stands for.
Title: Harry Potter and the Prisoner of AzkabanOverview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black — and turns to sympathetic Professor Lupin for help.
The overview for ‘Harry Potter and the Philosopher’s Stone’ contains the words ‘wizardry’ and ‘school’ and comes first in the search output. The second result no longer contains the word ‘school’, but still retains words close to ‘wizardry’, ‘Triwizard’. The third result contains only a synonym of ‘wizardry’ — magic.
There are, of course, a multitude of other movies within this database that feature schools or wizards (or both), but the above were the only ones returned to us. This is clear proof that the search solution works and actually understood the context of our query.
We used the Babbage model with only 2048 parameters. Davinci has six times more (12,288) parameters and can, thus, perform significantly better with regard to highly complex queries.
The search solution may occasionally fail to produce output relevant to some queries. For instance, the query ‘movies about wizards in school’ produces:
Title: Harry Potter and the Philosopher’s StoneOverview: Harry Potter has lived under the stairs at his aunt and uncle’s house his whole life. But on his 11th birthday, he learns he’s a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school’s kindly headmaster, Harry uncovers the truth about his parents’ deaths — and about the villain who’s to blame.
Title: Dumb and Dumberer: When Harry Met LloydOverview: This wacky prequel to the 1994 blockbuster goes back to the lame-brained Harry and Lloyd’s days as classmates at a Rhode Island high school, where the unprincipled principal puts the pair in remedial courses as part of a scheme to fleece the school.
Title: Harry Potter and the Prisoner of AzkabanOverview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black — and turns to sympathetic Professor Lupin for help.
What is ‘Dumb and Dumberer: When Harry Met Lloyd’ doing here you may wonder? Thankfully, this issue was not reproduced on parameters with more parameters.
The search output should consist of documents sorted in descending order by relevance. To achieve this, we should be aware of the distance between the current query and each document. The shorter the length, the comparatively more relevant the output. Then, after a defined maximum reach, we should stop considering the relevance of the remaining documents.
In the aforementioned example, we used
Distance calculation algorithms tend to represent this similarity (difference) between a query and a document with a single number. However, we cannot rely on the
If you would like, you may check out the resulting repository here:
Alternatively, you can play with it in Google Colab here.
We used the brute force approach to sort the documents. Let’s determine:
● n: number of points in the training dataset
● d: data dimensionality
The search time complexity for brute force solutions is O(n * d * n * log(n)). Parameter d depends on the model (in the case of Babbage, it is equal to 2048) while we have O(nlog(n)) block due to the sorting step.
It is critical to remind ourselves at this stage that smaller models are faster and cheaper. For instance, in the search case similarity calculation step, the Ada model is two times faster, while the Davinci model is six times slower.
Cosine similarity calculations between the query and 4803 documents of 2048 dimensions took 1260 ms on my M1 Pro. In the current implementation, time required to calculate would grow linearly to the total number of documents. Simultaneously, this approach supports computation parallelization.
In search solutions, queries should be completed as quickly as possible. And this price is usually paid on the side of training and pre-caching time. We can use data structures like a k-d tree, r-tree, or ball tree. Consider the article from
K-d trees, ball trees, and r-trees constitute data structures that are used to store and efficiently search for points in N-dimensional space, such as our meaning vectors.
K-d and ball trees are tree-based data structures that use an iterative, binary partitioning scheme to divide the space into regions, with each node in the tree representing a subregion. K-d trees are particularly efficient at searching for points within a specific range or finding the nearest neighbor to a given point.
Similarly, r-trees are also used to store points in N-dimensional space, however, they are much more efficient at searching for points within a specific region or finding all points within a certain distance of a given point. Importantly, r-trees use a different partitioning scheme to k-d trees and ball trees; they divide the space into ‘rectangles’ rather than binary partitions.
Tree implementations fall outside the scope of this article, and different implementations will lead to different search outputs.
Perhaps, the most significant disadvantage of the current search solution is that we must call an external OpenAI API to retrieve the query embedding vector. No matter how quickly our algorithm is able to find the nearest neighbors, a sequential blocking step will be required.
Text-search-babbage-query-001
Number of dimensions: 2048
Number of queries: 100
Average duration: 225ms
Median duration: 207ms
Max duration: 1301ms
Min duration: 176ms
Text-search-ada-query-002
Number of dimensions: 1536
Number of queries: 100
Average duration: 264ms
Median duration: 250ms
Max duration: 859ms
Min duration: 215ms
Text-search-davinci-query-001
Number of dimensions: 12288
Number of queries: 100
Average duration: 379ms
Median duration: 364ms
Max duration: 1161ms
Min duration: 271ms
If we take the median as a referencing point, we can see that ada-002 is +28% slower and davinci-001 is +76% slower.
Referring to
Also, training costs are relatively high with OpenAI.
Alternatively, you may consider trying
Elasticsearch 8.0 supports efficient approximate nearest neighbor search (ANN) that may be used to solve our problem magnitudes faster than any linear KNN could. Elasticsearch 8.0 utilizes an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW) to organize vectors into graphs based on similarity. Tested on a dataset of 10 million embedded vectors, we achieved an impressive performance of 200 queries per second with ANN on a single compute-focused machine, while only 2 queries per second using KNN. Both were provided by Elasticsearch.
According to ElasticSearch’s
As you will have hopefully seen, GPT-3 Embeddings is not the perfect solution for any and all search problems due to its indexing complexity, cost, as well as the high computational complexity of the search operation, even of approximates. Nevertheless, GPT-3 Embeddings remains an excellent choice for anyone looking for a powerful backbone for a search solution with infrequent querying and modest indexing requirements.
Moreover, it’s also worth adding that Microsoft recently announced that the Bing search engine now uses the new upgraded version of GPT 3.5, called ‘Prometheus’ and developed initially for search. According to the announcement, the new Prometheus language model allows Bing to increase relevance, more accurately annotate snippets, provide fresher results, understand geolocation and improve security. This may open up new possibilities for using the autoregressive language model for search solutions, which we will definitely keep an eye on going forward.
References:
Oleksandr Knyga is a Solution Architect and open-source believer with a big passion for Software Development. Oleksandr lives the life of a digital nomad. When not tapping on the keyboard, he can be found hiking, snowboarding, fishing, and exploring the city. Oleksandr has an MBA and a Computer Science Master’s degree.
Also published here.