Nowadays, semantic search is getting quite popular. If you are familiar with data science and NLP, you might have read about how you can apply semantic search through a set of textual data. Usually, besides building a model and application yourself, you also need to find or build an efficient storage and retrieval mechanism. When building an e-commerce website, you are not going to build your own database, but simply use something like PostgreSQL or MongoDB. So why bother about data storage in machine learning practice?
Machine learning data demand efficient and scalable storage. Vector databases are an excellent solution to store big data with context. While known user-friendly and efficient databases and search engines like Solr, MongoDB, or MySQL are excellent for traditional data storage and retrieval, vector databases like Weaviate enable big data storage along with machine-learned context. The open-source vector search engine Weaviate is an easy-to-use end-to-end solution, with machine learning capabilities out of the box.
In this article, you will learn what a vector search engine is and how you can use Weaviate with your own data in 5 minutes.
If you want to know what the color of the wine Pinot Noir is, you might do a quick Google search for “What color is Pinot Noir?”. Within less than a second Google tells you the (most probable) answer (see image below). Nowadays we are quite used to the power of Google Search — we are sometimes even disappointed if Google Search does not find a correct answer.
But have you ever thought about how powerful this search actually is? The search entry “What color is Pinot Noir?” is quite abstract. We didn’t specify that Pinot Noir is a wine; Google Search derived that from its known context. Then, Google Search finds the exact answer to the question, from a website on the internet. How does Google Search find exactly this data node in its graph with billions of indexed webpages? And how can it predict the relation between the query and the data node? Those are questions that we usually don’t need to think about because we’re happy that open search engines like Google Search, Bing, DuckDuckGo, or equivalents take care of that for us.
There is, however, a case popping up that open search engines can’t help you with: searching through your own data. Imagine you could search through your data in a similar fashion as you are used to with open data. Searching through unstructured data is difficult, and usually takes a lot of human effort. An open-source vector search engine like Weaviate can solve this for you.
In this article, I’ll show you how to easily set up Weaviate with your own data. The cool thing is, I only use data and software that is openly available. The vector search engine Weaviate is open-source, the open transformer model sentence-transformers/msmarco-distilroberta-base-v2 that I connect to Weaviate as a vectorization module, and a free sample dataset of wine reviews from Kaggle.
Weaviate is an open-source vector search engine. Unlike traditional databases, Weaviate stores data as vectors in a high-dimensional space. The position (the vectors) of individual data objects is automatically determined by a vectorization module, which is a pre-trained machine learning model. Here, I’m demonstrating how the popular transformers module sentence-transformers/msmarco-distilroberta-base-v2 (which is an excellent model for asymmetric search) can be used to let Weaviate understand your data. If you want to learn more about Weaviate and its transformers module, check the documentation. Using an out-of-the-box module with Weaviate, you don’t have to worry about how your data is stored with context, how relations between concepts are made, or how search results are found. The only thing you need to do is tell Weaviate what your data looks like and import your data, and you’re ready to fire away search queries!
The dataset I’m using in this example is a set of Wine reviews. For 2500 wines, I’m going to store the name, along with its review (from WineEnthusiast). (Weaviate is able to handle big amounts of data, but for the sake of this tutorial, 2500 data objects are sufficient.)
Use the configuration tool to generate a docker-compose configuration file. Retrieve the file with the generated curl command, which should be similar to:
curl -o docker-compose.yml "https://configuration.semi.technology/v2/docker-compose/docker-compose.yml?enterprise_usage_collector=false&media_type=text&qna_module=false&runtime=docker-compose&text_module=text2vec-transformers&transformers_model=sentence-transformers-msmarco-distilroberta-base-v2&weaviate_version=v1.7.2&ner_module=false&spellcheck_module=false&gpu_support=false"
The configuration file looks like this:
---
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: semitechnologies/weaviate:1.7.2
ports:
- 8080:8080
restart: on-failure:0
environment:
TRANSFORMERS_INFERENCE_API: 'http://t2v-transformers:8080'
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
ENABLE_MODULES: 'text2vec-transformers'
t2v-transformers:
image: semitechnologies/transformers-inference:sentence-transformers-msmarco-distilroberta-base-v2
environment:
ENABLE_CUDA: '0'
...
For this small demo dataset, we don’t need to worry about the model working better on CUDA-enabled hardware. But if you want to achieve maximum performance for your model inference, learn how to set up Weaviate with GPU-enabled hardware here.
Once you have stored this configuration file, you can start Weaviate in the same folder with:
$ docker-compose up -d
Now, Weaviate is running on
and is ready to be used. You can check this by running http://localhost:8080
from the command line, which should return meta information about the Weaviate instance.curl http://localhost:8080/v1/meta
Next, let’s upload a data schema. This defines the structure of the data that we’ll be adding later on. In our case, we have one class
“Wine”
, with a “title”
and “description”
. Upload the schema in JSON format to http://localhost:8080/v1/schema
in Python using the Weaviate client:
import weaviate
client = weaviate.Client("http://localhost:8080")
class_obj = {
"class": "Wine",
"properties": [
{
"name": "title",
"dataType": ["text"]
},
{
"name": "description",
"dataType": ["text"]
}
]
}
new_class = client.schema.create_class(class_obj)
The next step is uploading the actual wine data to Weaviate. Here, you can find the data and the script to import the data objects. We upload a total of 2500 wines with their titles and descriptions. This is done with Python using the Weaviate client. A simplified version of the import code which you can use directly is shown below. Make sure you download the wine dataset from here and store it as a
csv
file.import pandas as pd
import weaviate
# initiate the Weaviate client
client = weaviate.Client("http://localhost:8080")
# open wine dataset (10000 items)
df = pd.read_csv('wine_reviews.csv', index_col=0)
def add_wines(data, batch_size=99):
no_items_in_batch = 0
for index, row in data.iterrows():
wine_object = {
"title": row["title"] + '.',
"description": row["description"],
}
client.batch.add_data_object(wine_object, "Wine")
no_items_in_batch += 1
if no_items_in_batch >= batch_size:
results = client.batch.create_objects()
no_items_in_batch = 0
client.batch.create_objects()
# import only the first 2500 wines to Weaviate
add_wines(df.head(2500))
If all objects are uploaded successfully, you can start querying the data. With GraphQL we can create easy and more complex queries. Use the Weaviate Console to query Weaviate in a clear user interface. Connect to
, and open the query module on the left.http://localhost:8080
Let’s say we’re eating fish tonight, and we want to find out which wines fit well with this dish. We can do a
nearText
search for the concept “wine that fits with fish”
. We can limit the result to 3, and we see that three white wines appear, see the figure below.Now, let’s take a look at the individual wines. The titles of the wines don’t reveal any connection to fish dishes from my point of view as a wine amateur. But if we look closely at the description, we see a connection between the search query. The first result mentions that the wine is “a good wine for […] pairing with poultry and seafood dinners”. We see that the exact words in our fuzzy search query do not appear literally in the description of this data object. However, it is returned as the first result (meaning that this wine has the highest certainty to this search query compared to all other wines in the dataset). This is because there is a semantic relation between the query and the data object. Amazing, right?
In more detail, the vector position in the high dimensional space of this data object lies closest to the vector position of the search query, compared to all other data objects. We see this also with the second and third results. The results mention concepts like “pair”, “seafood” and “clams”.
We could take a step further and drill down our search. Given that all data objects are stored as vectors in a high-dimensional space, we could literally move our search query to or away from concepts. For example, let’s say we know from experience that we don’t like Chardonnay. We can move our search query away from Chardonnay by adding the argument
moveAwayFrom
with the concept “Chardonnay”
to the query. Now, the returned wines are still related to our fish dinner but are not Chardonnay anymore.Now let’s add the argument
moveTo
with the concept “Italy”
because we prefer to have an Italian wine. If we check the new results in Figure 4, we see that all three wines are now also from Italy! And if you look closely, the term “Italy
”
, is not even mentioned in all results! But Weaviate (or rather, the transformer model used in the vectorization module) understands that these wines are related to Italy, fit well with fish, and are not Chardonnay.In this article, you have learned about semantic search and vector storage. The step-by-step guide shows how you can set up your own vector database with your own data and Weaviate. With out-of-the-box machine learning models, you don’t have to worry about how the data is stored. Semantic search through your own data may open many opportunities!