Modern applications often incorporate search solutions to enable users to quickly access existing content on demand. It is difficult to envision any other functionality that could efficiently retrieve this information, making search an essential feature in most applications. Custom-made search solution options Simultaneously, even if the need to query and search is very common, different applications take drastically different approaches. In most cases, companies remain at a very basic level of search by simply querying OLTP databases directly. Requests may look like this: . SELECT id, title FROM, entities WHERE, description LIKE ‘%bow% However, more frequently, they are represented in complex, multiple-level table join structures which are impossible to read, slow, and primitive. They are unable to comprehend context, require numerous customizations, and are very challenging to implement properly. While it is possible to improve query execution time through materialized views, query caching, and other technics, the added complexity results in a considerable lag between primary updates and consistent search results. More efficient alternatives to primitive DB-based search solutions may constitute open-source search engines such as Apache Lucene, Apache Solr, Elasticsearch, Sphinx, MeiliSearch, Typesense, etc. These tend to be comparatively faster and much better at dealing with complex queries and working with filters. But once these search engines are compared to counterparts like Search or DuckDuckGo, it becomes clear that open-source solutions fail at building proper search context and query modalities — they are incapable of understanding a query if the user provides a vague search request. Google Extracting the meaning from the query Imagine you simply cannot remember what that yellow citrus fruit with a sour taste are called! But you want to search the app for an article on how to grow this mysterious fruit. How do you go about that search? Your query may be, “How to grow yellow sour citrus indoors”. Any of the aforementioned open-source search engines may struggle significantly to return relevant results for this query, even if the database does contain articles about growing “lemons”. That is because the extraction of meaning from a query is a natural language task and is unlikely to be solved without AI components. is good at this task. GPT-3 OpenAI offers based on GPT-3 that converts natural language text into a vector of floating point numbers. An embedding is essentially a low-dimensional space into which high-dimensional vectors can be translated. In this case, the high-dimensional vector is text and the low-dimensional is an output vector of fixed size. The distance between vectors represents how strongly they correlate. The smaller the distance, the higher the correlation. By redefining text as a vector-based form, the task is reduced to a classical ML search problem. an embeddings API Choosing the right model The conversion of document text into a vector representative can happen in the background, while the vectorization of the search query should occur during runtime. There are several GPT-3 model families that OpenAI offers: text-search-ada-doc-001: 1024
text-search-babbage-doc-001: 2048
text-search-curie-doc-001: 4096
text-search-davinci-doc-001: 12288 Higher vector dimensions lead to more embedded information and, thus, also higher costs and slower searches. Documents are usually long and queries are generally short and incomplete. Therefore, the vectorization of any document differs significantly from any query’s vectorization considering the content’s density and size. OpenAI know that, and so they offer two paired models, and : -doc -query text-search-ada-query-001: 1024
text-search-babbage-query-001: 2048
text-search-curie-queryc-001: 4096
text-search-davinci-query-001: 12288 It is important to note that the query and document must both utilize the same model family and have the same length of the output vector. Example dataset It may be easiest to observe and understand the power of this search solution through examples. For this example, let us draw on the which contains metadata about roughly 5,000 movies from TMDb. We will build a search solution relying only on movie documents, not reviews. TMDB 5000 Movie Dataset The dataset contains plenty of columns, but our vectorization process will be built around the title and overview columns only. Example of the record: Title: Harry Potter and the Half-Blood Prince

Overview: As Harry begins his sixth year at Hogwarts, he discovers an old book marked as 
‘Property of the Half-Blood Prince’, and begins to learn more about Lord Voldemort’s dark past. Let’s map the dataset into ready-to-indexing text: datafile_path = "./tmdb_5000_movies.csv"
df = pd.read_csv(datafile_path)

def combined_info(row):
  columns = ['title', 'overview']
  columns_to_join = [f"{column.capitalize()}: {row[column]}" for column in columns]
  return '\n'.join(columns_to_join)

df['combined_info'] = df.apply(lambda row: combined_info(row), axis=1) The embedding process is straightforward: def get_embedding(text, model="text-search-babbage-doc-001"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

get_embedding(df['combined_info'][0]) This block of code outputs a list the size of which is equal to the parameters that the model is operating with, which in the case of is 2048. text-search-babbage-doc-001 A similar embedding process should be applied to all documents we would like to search on: df['combined_info_search'] = df['combined_info'].apply(lambda x: get_embedding(x, model='text-search-babbage-doc-001'))
df.to_csv('./tmdb_5000_movies_search.csv', index=False) Column will hold a vector representation of the combined_text. combined_info_search And, surprisingly, that’s already it! Finally, we are ready to perform a sample search query: from openai.embeddings_utils import get_embedding, cosine_similarity

def search_movies(df, query, n=3, pprint=True):
    embedding = get_embedding(
        query,
        engine="text-search-babbage-query-001"
    )
    df["similarities"] = df.combined_info_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(n)
        .combined_info
    )
    if pprint:
        for r in res:
            print(r[:200])
            print()
    return res


res = search_movies(df, "movie about the wizardry school", n=3) These are the results we get: Title: Harry Potter and the Philosopher’s StoneOverview: Harry Potter has lived under the stairs at his aunt and uncle’s house his whole life. But on his 11th birthday, he learns he’s a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school’s kindly headmaster, Harry uncovers the truth about his parents’ deaths — and about the villain who’s to blame.

Title: Harry Potter and the Goblet of FireOverview: Harry starts his fourth year at Hogwarts, competes in the treacherous Triwizard Tournament and faces the evil Lord Voldemort. Ron and Hermione help Harry manage the pressure — but Voldemort lurks, awaiting his chance to destroy Harry and all that he stands for.

Title: Harry Potter and the Prisoner of AzkabanOverview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black — and turns to sympathetic Professor Lupin for help. The overview for ‘ ’ contains the words ‘wizardry’ and ‘school’ and comes first in the search output. The second result no longer contains the word ‘school’, but still retains words close to ‘wizardry’, ‘Triwizard’. The third result contains only a synonym of ‘wizardry’ — magic. Harry Potter and the Philosopher’s Stone There are, of course, a multitude of other movies within this database that feature schools or wizards (or both), but the above were the only ones returned to us. This is clear proof that the search solution works and actually understood the context of our query. We used the Babbage model with only 2048 parameters. Davinci has six times more (12,288) parameters and can, thus, perform significantly better with regard to highly complex queries. The search solution may occasionally fail to produce output relevant to some queries. For instance, the query ‘movies about wizards in school’ produces: Title: Harry Potter and the Philosopher’s StoneOverview: Harry Potter has lived under the stairs at his aunt and uncle’s house his whole life. But on his 11th birthday, he learns he’s a powerful wizard — with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school’s kindly headmaster, Harry uncovers the truth about his parents’ deaths — and about the villain who’s to blame.

Title: Dumb and Dumberer: When Harry Met LloydOverview: This wacky prequel to the 1994 blockbuster goes back to the lame-brained Harry and Lloyd’s days as classmates at a Rhode Island high school, where the unprincipled principal puts the pair in remedial courses as part of a scheme to fleece the school.

Title: Harry Potter and the Prisoner of AzkabanOverview: Harry, Ron and Hermione return to Hogwarts for another magic-filled year. Harry comes face to face with danger yet again, this time in the form of an escaped convict, Sirius Black — and turns to sympathetic Professor Lupin for help. What is ‘Dumb and Dumberer: When Harry Met Lloyd’ doing here you may wonder? Thankfully, this issue was not reproduced on parameters with more parameters. Calculating the distance between the query and documents The search output should consist of documents sorted in descending order by relevance. To achieve this, we should be aware of the distance between the current query and each document. The shorter the length, the comparatively more relevant the output. Then, after a defined maximum reach, we should stop considering the relevance of the remaining documents. In the aforementioned example, we used to calculate distance due to the high dimensionality of the vector space. With the babbage model, we have 2048 parameters. cosine similarity Distance calculation algorithms tend to represent this similarity (difference) between a query and a document with a single number. However, we cannot rely on the because of the — distances will be too similar. This is because Euclidean distance becomes highly impractical beyond around seven dimensions — at this point, distances fall into the same buckets and become almost identical. Euclidean distance curse of dimensionality If you would like, you may check out the resulting repository here: https://github.com/weblab-technology/gpt3-embeddings-movies-search?embedable=true Alternatively, you can play with it in Google Colab . here Time complexity We used the brute force approach to sort the documents. Let’s determine: ● n: number of points in the training dataset ● d: data dimensionality The search time complexity for brute force solutions is . Parameter depends on the model (in the case of Babbage, it is equal to 2048) while we have block due to the sorting step. O(n * d * n * log(n)) d O(nlog(n)) It is critical to remind ourselves at this stage that smaller models are faster and cheaper. For instance, in the search case similarity calculation step, the Ada model is two times faster, while the Davinci model is six times slower. Cosine similarity calculations between the query and 4803 documents of 2048 dimensions took 1260 ms on my M1 Pro. In the current implementation, time required to calculate would grow linearly to the total number of documents. Simultaneously, this approach supports computation parallelization. Alternatives to the brute force solution In search solutions, queries should be completed as quickly as possible. And this price is usually paid on the side of training and pre-caching time. We can use data structures like a k-d tree, r-tree, or ball tree. Consider the article from about the computational complexity analysis of these methods: They all lead to computational complexity close to , where is the number of elements we would like to return within a single batch. Towards Data Science O(k * log(n)) k K-d trees, ball trees, and r-trees constitute data structures that are used to store and efficiently search for points in N-dimensional space, such as our meaning vectors. K-d and ball trees are tree-based data structures that use an iterative, binary partitioning scheme to divide the space into regions, with each node in the tree representing a subregion. K-d trees are particularly efficient at searching for points within a specific range or finding the nearest neighbor to a given point. Similarly, r-trees are also used to store points in N-dimensional space, however, they are much more efficient at searching for points within a specific region or finding all points within a certain distance of a given point. Importantly, r-trees use a different partitioning scheme to k-d trees and ball trees; they divide the space into ‘rectangles’ rather than binary partitions. Tree implementations fall outside the scope of this article, and different implementations will lead to different search outputs. Query time Perhaps, the most significant disadvantage of the current search solution is that we must call an external OpenAI API to retrieve the query embedding vector. No matter how quickly our algorithm is able to find the nearest neighbors, a sequential blocking step will be required. Text-search-babbage-query-001
Number of dimensions: 2048
Number of queries: 100
Average duration: 225ms
Median duration: 207ms
Max duration: 1301ms
Min duration: 176ms Text-search-ada-query-002
Number of dimensions: 1536
Number of queries: 100
Average duration: 264ms
Median duration: 250ms
Max duration: 859ms
Min duration: 215ms Text-search-davinci-query-001
Number of dimensions: 12288
Number of queries: 100
Average duration: 379ms
Median duration: 364ms
Max duration: 1161ms
Min duration: 271ms If we take the median as a referencing point, we can see that ada-002 is +28% slower and davinci-001 is +76% slower. Limitations of GPT-3 Search Embedding Referring to , we may conclude that GPT-3 does not provide an exceptional performance or output quality and requires dependence on external API that is rather slow. GPT-3 has the capability to support inputs of up to 4096 tokens (approximately 3072 words), however, there is no truncation service available through the API and attempting to encode text longer than 4096 tokens will result in an error. Thus, it is the responsibility of the user — you — to determine how much text can actually be encoded. Nils Reimer’s article about dense text embeddings model comparison Also, training costs are relatively high with OpenAI. Alternatively, you may consider trying or . TAS-B multi-qa-mpnet-base-dot-v1 Approximate nearest neighbor search in Elasticsearch Elasticsearch 8.0 supports efficient approximate nearest neighbor search (ANN) that may be used to solve our problem magnitudes faster than any linear KNN could. Elasticsearch 8.0 utilizes an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW) to organize vectors into graphs based on similarity. Tested on a dataset of 10 million embedded vectors, we achieved an impressive performance of 200 queries per second with ANN on a single compute-focused machine, while only 2 queries per second using KNN. Both were provided by Elasticsearch. According to ElasticSearch’s and the , the cost of this performance is 5% of recall. This means that around 1 in 10 of the closest vectors found are not genuinely relative. We applied a hybrid approach to improve this result: Request k+l instead of k and apply filtering to remove outliers. Sadly, there are no pagination capabilities. So, this approach will only work in some use cases. blog post about the ANN release benchmark for dense vector searches As you will have hopefully seen, GPT-3 Embeddings is not the perfect solution for any and all search problems due to its indexing complexity, cost, as well as the high computational complexity of the search operation, even of approximates. Nevertheless, GPT-3 Embeddings remains an excellent choice for anyone looking for a powerful backbone for a search solution with infrequent querying and modest indexing requirements. Moreover, it’s also worth adding that recently that the Bing search engine now uses the new upgraded version of GPT 3.5, called ‘Prometheus’ and developed initially for search. According to the announcement, the new Prometheus language model allows Bing to increase relevance, more accurately annotate snippets, provide fresher results, understand geolocation and improve security. This may open up new possibilities for using the autoregressive language model for search solutions, which we will definitely keep an eye on going forward. Microsoft announced References: New and Improved Embedding Model k nearest neighbors computational complexity scipy.spatial.KDTree k-d tree OpenAI GPT-3 Text Embeddings — Really a new state-of-the-art in dense text embeddings? Introducing approximate nearest neighbor search in Elasticsearch 8.0 About the author Oleksandr Knyga is a Solution Architect and open-source believer with a big passion for Software Development. Oleksandr lives the life of a digital nomad. When not tapping on the keyboard, he can be found hiking, snowboarding, fishing, and exploring the city. Oleksandr has an MBA and a Computer Science Master’s degree. Also published here.

This story contains new, firsthand information uncovered by the writer.

How to Manage Your SOC 2 Certification and Be Successful

Harnessing GPT-3 for Improved Search Solutions

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Celebrating 10 years of WebLab Technology: Our Story of Growing Through Dedicated Teams

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Celebrating 10 years of WebLab Technology: Our Story of Growing Through Dedicated Teams

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps