paint-brush
Using a Relational Database to Query Unstructured Databy@chu
2,212 reads
2,212 reads

Using a Relational Database to Query Unstructured Data

by Nelson ChuNovember 24th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

80% of the world’s data today are unstructured data, and it includes raw text, images, audio, video, and a variety of other file types. Superinsight empowers anyone with a database background to run queries on top of unstructured data.
featured image - Using a Relational Database to Query Unstructured Data
Nelson Chu HackerNoon profile picture


Relational databases have been the foundation of some of the world’s most critical applications for the last fifty years. While there are other databases available, such as Key Value, Document, and Graph databases, relational databases remain the most popular. The main reason for this is that SQL, or relational database language, is easy for beginners to learn and effective for experts to implement.


However, as we enter a new era of artificial intelligence, storing structured data is no longer sufficient. Today 80% of the world’s data today is unstructured data, and it includes raw text, images, audio, video, and a variety of other file types. These unstructured data contain a large amount of critical information that we cannot retrieve using the current relational database system. The good news is that recent advances in machine learning allow us to comprehend unstructured data in ways never before possible, and we can capitalize on this opportunity by making this technology more accessible.

Relational database for unstructured data

In this article, I will introduce Superinsight, a relational database with built-in machine-learning capabilities. Superinsight empowers anyone with an SQL background to run a query on top of unstructured data.


The core concept for Superinsight are the following

  • Superinsight is built on top of PostgreSQL, so it has all of PostgreSQL’s features as well as machine learning capabilities.
  • Superinsight automatically analyzes and indexes any data stored in columns with data type [TEXT].
  • Raw text data is stored as values, while file-based data such as images are stored as a reference to the actual location.


Installation

Superinsight can be run as a standalone machine or as part of an orchestration framework like Kubernetes. The standalone version will be used for the purposes of this demonstration.

Docker is the most convenient way to install Superinsight. Before running the command, remember to change the username and password.


docker run \
--name superinsight-db-standalone \
-p 5432:5432 \
-e SUPERINSIGHT_USER=admin \
-e SUPERINSIGHT_PASSWORD=password \
-e ENV_IMAGE_TO_LABEL=True \
superinsight/superinsight-db-standalone:latest

For this demonstration, we are going to set the flag ENV_IMAGE_TO_LABEL=True, this option allows us to search on images base on text.

Connecting

Once the docker container is running, you can connect to Superinsight using the PostgreSQL protocol and the default database Superinsight on Port 5432.

postgres://admin:[email protected]:5432/superinsight

Storing and Searching Unstructured Data

Superinsight can store both structured and unstructured data using standard SQL. Superinsight will only analyze unstructured data if it is stored in a column with the data type [TEXT]. In the following example, we will use the Flipkart Products Dataset.

Create Table

The standard CREATE TABLE statement can be used to create a table. Because we only want Superinsight to analyze the columns under name, image, and description as unstructured data, we use the data type [TEXT] for these columns. Also, each table must have a primary key; this is a requirement for Superinsight.


CREATE TABLE mldb.product (
    _id serial PRIMARY KEY,
    category varchar(500),
    brand varchar(500),
    name TEXT,
    image TEXT,
    description TEXT,
    price float8,
    url varchar(500)
);


Insert Records

Standard SQL Insert statements can be used to insert records. The values for the column names and descriptions are stored in the database, but for images, we store the image URL. Superinsight will automatically read and index the images based on each image URL. In this example, we’ll add 20 records to our database. SQL script can be downloaded from here.

Semantic Search for Raw Text

We will create a semantic search query using the JOIN statement between our table and the prebuilt model. The keyword ON indicates which column is used as input for the semantic search model. The WHERE keyword is used to specify the value to compare to. Finally, we can use predictions.score as ORDER BY to sort the most relevant results.


Note: Superinsight must download the latest semantic search model to your database before our first search. Depending on internet bandwidth, this may take 1–2 minutes.


SELECT predictions.score, mldb.product.image, mldb.product.name, mldb.product.description
FROM mldb.product
JOIN model.semantic_search
ON model.semantic_search.inputs = mldb.product.description
WHERE model.semantic_search.similar = 'Something to sit on'
ORDER BY predictions.score DESC

Our semantic search for “Something to sit on” yielded four records of sofa beds with scores greater than 0.11. The most significant distinction between Superinsight and other relational databases is that the results are scored by machine learning models that understand the context of the data, resulting in much richer search results.

Semantic Search for Image To Image

Image searching is just like searching for raw text. All that needs to be changed is the WHERE clause value and the ON clause condition. We will use an image url as the search value to find records based on image similarity.


SELECT predictions.score, mldb.product.image, mldb.product.name, mldb.product.description
FROM mldb.product
JOIN model.semantic_search
ON model.semantic_search.inputs = mldb.product.image
WHERE model.semantic_search.similar = 'http://img5a.flixcart.com/image/short/u/4/a/altht-3p-21-alisha-38-original-imaeh2d5vm5zbtgg.jpeg'
ORDER BY predictions.score DESC

Notice the first image has a score of 1 because it is exactly the same image we are looking for. We can see the following five images are very similar so they have scores higher than 0.87.

Semantic Search for Text To Image

We can also search for images by describing the context of the image with text. In the following example, we’re looking for “woman dress” images.


SELECT predictions.score, mldb.product.image, mldb.product.name, mldb.product.description
FROM mldb.product
JOIN model.semantic_search
ON model.semantic_search.inputs = mldb.product.image
WHERE model.semantic_search.similar = 'woman dress'
ORDER BY predictions.score DESC



Conclusion

Superinsight makes it simple to store and search unstructured data. Since it is based on PostgreSQL, any existing applications and data visualization tools that support the PostgreSQL protocol can now use Superinsight. For more information on Superinsight, visit the links below.