Nowadays, is getting quite popular. If you are familiar with data science and NLP, you might have read about how you can apply semantic search through a set of textual data. Usually, besides building a model and application yourself, you also need to find or build an efficient storage and retrieval mechanism. When building an e-commerce website, you are not going to build your own database, but simply use something like PostgreSQL or MongoDB. So why bother about data storage in machine learning practice? semantic search Machine learning data demand efficient and scalable storage. Vector databases are an excellent solution to store big data with context. While known user-friendly and efficient databases and search engines like Solr, MongoDB, or MySQL are excellent for data storage and retrieval, databases like Weaviate enable big data storage along with machine-learned context. The open-source vector search engine Weaviate is an easy-to-use end-to-end solution, with machine learning capabilities out of the box. traditional vector In this article, you will learn what a vector search engine is and how you can use Weaviate with your own data in 5 minutes. What is a vector search engine? If you want to know what the color of the wine Pinot Noir is, you might do a quick Google search for Within less than a second Google tells you the (most probable) answer (see image below). Nowadays we are quite used to the power of Google Search — we are sometimes even disappointed if Google Search does find a correct answer. “What color is Pinot Noir?”. not But have you ever thought about how powerful this search actually is? The search entry is quite abstract. We didn’t specify that Pinot Noir is a wine; Google Search derived that from its known context. Then, Google Search finds the exact answer to the question, from a website on the internet. How does Google Search find exactly in its graph with billions of indexed webpages? And how can it predict the relation between the query and the data node? Those are questions that we usually don’t need to think about because we’re happy that open search engines like Google Search, Bing, DuckDuckGo, or equivalents take care of that for us. “What color is Pinot Noir?” this data node There is, however, a case popping up that open search engines can’t help you with: searching through your own data. Imagine you could search through your data in a similar fashion as you are used to with open data. Searching through unstructured data is difficult, and usually takes a lot of human effort. An open-source vector search engine like Weaviate can solve this for you. In this article, I’ll show you how to easily set up Weaviate with your own data. The cool thing is, I only use data and software that is openly available. The , the open transformer model that I connect to Weaviate as a vectorization , and a free sample dataset of . vector search engine Weaviate is open-source sentence-transformers/msmarco-distilroberta-base-v2 module wine reviews from Kaggle How to use a semantic search engine with your own data? Weaviate is an open-source vector search engine. Unlike traditional databases, Weaviate stores data as vectors in a high-dimensional space. The position (the vectors) of individual data objects is automatically determined by a , which is a pre-trained machine learning model Here, I’m demonstrating how the popular transformers module (which is an excellent model for ) can be used to let Weaviate your data. If you want to learn more about Weaviate and its transformers module, . Using an out-of-the-box module with Weaviate, you don’t have to worry about your data is stored with context, how relations between concepts are made, or how search results are found. The only thing you need to do is tell Weaviate what your data looks like and import your data, and you’re ready to fire away search queries! vectorization module . sentence-transformers/msmarco-distilroberta-base-v2 asymmetric search understand check the documentation how The dataset I’m using in this example is a set of Wine reviews. For 2500 wines, I’m going to store the name, along with its review (from WineEnthusiast). (Weaviate is able to handle big amounts of data, but for the sake of this tutorial, 2500 data objects are sufficient.) Step 1 — Start Weaviate Use the to generate a docker-compose configuration file. Retrieve the file with the generated curl command, which should be similar to: configuration tool curl -o docker-compose.yml "https://configuration.semi.technology/v2/docker-compose/docker-compose.yml?enterprise_usage_collector=false&media_type=text&qna_module=false&runtime=docker-compose&text_module=text2vec-transformers&transformers_model=sentence-transformers-msmarco-distilroberta-base-v2&weaviate_version=v1.7.2&ner_module=false&spellcheck_module=false&gpu_support=false" The configuration file looks like this: --- version: services: weaviate: command: - --host - - --port - - --scheme - http image: semitechnologies/weaviate: ports: - : restart: on-failure: environment: TRANSFORMERS_INFERENCE_API: QUERY_DEFAULTS_LIMIT: AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: PERSISTENCE_DATA_PATH: DEFAULT_VECTORIZER_MODULE: ENABLE_MODULES: t2v-transformers: image: semitechnologies/transformers-inference:sentence-transformers-msmarco-distilroberta-base-v2 environment: ENABLE_CUDA: ... '3.4' 0.0 .0 .0 '8080' 1.7 .2 8080 8080 0 'http://t2v-transformers:8080' 25 'true' '/var/lib/weaviate' 'text2vec-transformers' 'text2vec-transformers' '0' For this small demo dataset, we don’t need to worry about the model working better on CUDA-enabled hardware. But if you want to achieve maximum performance for your model inference, learn how to set up Weaviate with GPU-enabled hardware . here Once you have stored this configuration file, you can start Weaviate in the same folder with: $ docker-compose up -d Now, Weaviate is running on and is ready to be used. You can check this by running from the command line, which should return meta information about the Weaviate instance. http://localhost:8080 curl http://localhost:8080/v1/meta Step 2 — Create a data schema Next, let’s upload a data schema. This defines the structure of the data that we’ll be adding later on. In our case, we have one class , with a and . Upload the schema in JSON format to in Python using the : “Wine” “title” “description” http://localhost:8080/v1/schema Weaviate client weaviate client = weaviate.Client( ) class_obj = { : , : [ { : , : [ ] }, { : , : [ ] } ] } new_class = client.schema.create_class(class_obj) import "http://localhost:8080" "class" "Wine" "properties" "name" "title" "dataType" "text" "name" "description" "dataType" "text" Step 3 — Upload data The next step is uploading the actual wine data to Weaviate. , you can find the data and the script to import the data objects. We upload a total of 2500 wines with their titles and descriptions. This is done with Python using the . A simplified version of the import code which you can use directly is shown below. Make sure you download the wine dataset from and store it as a file. Here Weaviate client here csv pandas pd weaviate # initiate the Weaviate client client = weaviate.Client( ) # open wine dataset ( items) df = pd.read_csv( , index_col= ) def add_wines(data, batch_size= ): no_items_in_batch = index, row data.iterrows(): wine_object = { : row[ ] + , : row[ ], } client.batch.add_data_object(wine_object, ) no_items_in_batch += no_items_in_batch >= batch_size: results = client.batch.create_objects() no_items_in_batch = client.batch.create_objects() # only the first wines to Weaviate add_wines(df.head( )) import as import "http://localhost:8080" 10000 'wine_reviews.csv' 0 99 0 for in "title" "title" '.' "description" "description" "Wine" 1 if 0 import 2500 2500 Semantic queries If all objects are uploaded successfully, you can start querying the data. With GraphQL we can create easy and more complex queries. Use the to query Weaviate in a clear user interface. Connect to , and open the query module on the left. Weaviate Console http://localhost:8080 Let’s say we’re eating fish tonight, and we want to find out which wines fit well with this dish. We can do a search for the concept . We can limit the result to 3, and we see that three white wines appear, see the figure below. nearText “wine that fits with fish” Now, let’s take a look at the individual wines. The titles of the wines don’t reveal any connection to fish dishes from my point of view as a wine amateur. But if we look closely at the description, we see a connection between the search query. The first result mentions that the wine is “ ”. We see that the exact words in our fuzzy search query do not appear literally in the description of this data object. However, it is returned as the first result (meaning that this wine has the highest certainty to this search query compared to all other wines in the dataset). This is because there is a semantic relation between the query and the data object. Amazing, right? a good wine for […] pairing with poultry and seafood dinners In more detail, the vector position in the high dimensional space of this data object lies closest to the vector position of the search query, compared to all other data objects. We see this also with the second and third results. The results mention concepts like “ “ ” and “ ”. pair”, seafood clams We could take a step further and drill down our search. Given that all data objects are stored as vectors in a high-dimensional space, we could literally our search query or concepts. For example, let’s say we know from experience that we don’t like Chardonnay. We can move our search query away from Chardonnay by adding the argument with the concept to the query. Now, the returned wines are still related to our fish dinner but are not Chardonnay anymore. move to away from moveAwayFrom “Chardonnay” Now let’s add the argument with the concept because we prefer to have an Italian wine. If we check the new results in Figure 4, we see that all three wines are now also from Italy! And if you look closely, the term , is not even mentioned in all results! But Weaviate (or rather, the transformer model used in the vectorization module) understands that these wines are related to Italy, fit well with fish, and are not Chardonnay. moveTo “Italy” “Italy ” Summary In this article, you have learned about semantic search and vector storage. The step-by-step guide shows how you can set up your own vector database with your own data and Weaviate. With out-of-the-box machine learning models, you don’t have to worry about how the data is stored. Semantic search through your own data may open many opportunities!