Hackernoon logoBuilding a k-NN Similarity Search Engine using Amazon Elasticsearch and SageMaker by@yi

Building a k-NN Similarity Search Engine using Amazon Elasticsearch and SageMaker

Author profile picture

@yiYi Ai

Amazon Elasticsearch Service recently added support for k-nearest neighbor search. It enables you to run high scale and low latency k-NN search across thousands of dimensions with the same ease as running any regular Elasticsearch query.
k-NN similarity search is powered by Open Distro for Elasticsearch, an Apache 2.0-licensed distribution of Elasticsearch.
In this post, I’ll show you how to build a scalable similarity questions search api using Amazon Sagemaker, Amazon Elasticsearch, Amazon Elastic File System (EFS) and Amazon ECS.
What we’ll cover in this example:
  1. Deploy and run a Sagemaker notebook instance in VPC.
  2. Mount EFS to notebook instance.
  3. Download Quora Question Pairs dataset, then map variable-length questions from dataset to fixed-length vectors using DistilBERT model.
  4. Create downstream task to reduce embedding dimensions and save sentence embedder to EFS.
  5. Transform questions text to vectors, and index all vectors to Elasticsearch.
  6. Deploy a containerized Flask rest api to ECS. 
The following diagram shows the architecture of above steps:

Deploying and running Sagemaker notebook instance in VPC

First, Let’s create a Sagemaker notebook instance which connects to Elasticsearch and make sure then are in the same VPC. 
To configure VPC options in Sagemaker console, within Network section of Create notebook instance page, set VPC network configuration details such as VPC subnet IDs and security group IDs:

Mounting EFS to notebook instance

We will do all the necessary sentence transforming steps in a SageMaker notebook (code found here).
Now, mounting EFS to model directory, for more details about EFS, please check AWS official document.
%%sh
mkdir model
sudo mount -t nfs \
    -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 \
    fs-xxxxxx.efs.ap-southeast-2.amazonaws.com:/ \
    ./model
NOTE:
fs-xxxxx.efs.ap-southeast-2.amazonaws.com
is DNS name of EFS.
EFS Mount targets and Sagemaker are in same VPC.

Mapping variable-length questions to fixed-length vectors using DistilBERT model

To run nearest neighbor search, we have to get sentence and tokens embeddings. We can use sentence-transformers which is a sentence embeddings using BERT / RoBERTa / DistilBERT / ALBERT / XLNet with PyTorch. It allows us to map sentences into fixed-length representations in just a few lines of code.
We will use light weight DistilBERT model to generate sentence embeddings in this example, please note that the number of hidden units of DistilBERT is 768. This dimension seems too big to Elasticsearch index, we can reduce the dimension to 256 by adding a dense layer after the pooling:
from sentence_transformers import models, losses, SentenceTransformer

word_embedding_model = models.DistilBERT('distilbert-base-uncased')

pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                            pooling_mode_mean_tokens=True,
                            pooling_mode_cls_token=False,
                            pooling_mode_max_tokens=False)
# reduce dim from 768 to 256
dense_model = models.Dense(in_features=768, out_features=256)
transformer = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
Next, Save sentence embedder to EFS mounted directory:
transformer.save("model/transformer-v1/")
We need to make sure dataset has been downloaded, the data set for this example is quora question paris datas.

import pandas as pd
from kaggle.api.kaggle_api_extended import KaggleApi

api = KaggleApi()
api.authenticate()
api.dataset_download_files("quora/question-pairs-dataset", path='quora_dataset', unzip=True)
Next, extract full text of each question to dataframe:
import pandas as pd

pd.set_option('display.max_colwidth', -1)
df = pd.read_csv("quora_dataset/questions.csv", usecols=["qid1", "question1"], index_col=False)
df = df.sample(frac=1).reset_index(drop=True)
df_questions_imp = df[:5000]
Transforming questions text to vectors, and index all vectors to Elasticsearch
To start with, create a kNN index,
import boto3
from requests_aws4auth import AWS4Auth
from elasticsearch import Elasticsearch, RequestsHttpConnection

region = 'ap-southeast-2'
service = 'es'
ssm = boto3.client('ssm', region_name=region)
es_parameter = ssm.get_parameter(Name='/KNNSearch/ESUrl')
es_host = es_parameter['Parameter']['Value']
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                   region, service, session_token=credentials.token)
                   
es = Elasticsearch(
    hosts=[{'host': es_host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

knn_index = {
    "settings": {
        "index.knn": True
    },
    "mappings": {
        "properties": {
            "question_vector": {
                "type": "knn_vector",
                "dimension": 256
            }
        }
    }
}

es.indices.create(index="questions",body=knn_index,ignore=400)
then transform and index question vectors to Elasticsearch.
def es_import(df):
    for index, row in df.iterrows():
        vectors = local_transformer.encode([row["question1"]])
        es.index(index='questions',
                 id=row["qid1"], 
                 body={"question_vector": vectors[0].tolist(), 
                       "question": row["question1"]})
        
es_import(df_questions_imp)
Questions in Elasticsearch have the following structure:
{'question_vector': [-0.06435434520244598, ... ,0.0726890116930008],
'question': 'How hard is it to learn to play piano as an adult?'}
We have embedded questions into fixed-length vectors and indexed all of vectors to Elasticsearch. Let’s create a rest api connects to Elasticsearch and test it out!

Deploying a containerized Flask rest API

We will use sample cloud formation template to create ECS Cluster and service in VPC (templates and bash scripts found here).
We will use EFS volumes with ECS, search flow is 
  1. Flask application loads saved sentence embedder in EFS volume, 
  2. transform input parameter sentence to vectors, 
  3. then query K-Nearest neighbors in Elasticsearch.
import json
import boto3
from flask import Flask
from flask_restful import reqparse, Resource, Api
from elasticsearch import Elasticsearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
from sentence_transformers import SentenceTransformer

app = Flask(__name__)
api = Api(app)
region = 'ap-southeast-2'
ssm = boto3.client('ssm', region_name=region)
es_parameter = ssm.get_parameter(Name='/KNNSearch/ESUrl')
host = es_parameter['Parameter']['Value']
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key,
                   region, service, session_token=credentials.token)

parser = reqparse.RequestParser()
parser.add_argument('question')
parser.add_argument('size')
parser.add_argument('min_score')

es = Elasticsearch(
    hosts=[{'host': host, 'port': 443}],
    http_auth=awsauth,
    use_ssl=True,
    verify_certs=True,
    connection_class=RequestsHttpConnection
)

transform_model = SentenceTransformer(
    'model/transformer-v1/')

class SimilarQuestionList(Resource):
    def post(self):
        args = parser.parse_args()
        sentence_embeddings = transform_model.encode([args
                                                      ["question"]])
        res = es.search(index="questions",
                        body={
                            "size": args.get("size", 5),
                            "_source": {
                                "exclude": ["question_vector"]
                            },
                            "min_score": args.get("min_score", 0.3),
                            "query": {
                                "knn": {
                                    "question_vector": {
                                        "vector": sentence_embeddings[0].tolist(),
                                        "k": args.get("size", 5)
                                    }
                                }
                            }
                        })
        return res, 201


api.add_resource(SimilarQuestionList, '/search')

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=8000)
We now have flask api running in ECS container, let’s use the basic search function to find similar questions when it comes to the query: “What is best way to make money online?”:
$curl --data 'question=What is best way to make money online?' --data 'size=5' --data 'min_score=0.3'  -X POST http://knn-s-publi-xxxx-207238135.ap-southeast-2.elb.amazonaws.com/search
Check out the result:
{
    "took": 10,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 15,
            "relation": "eq"
        },
        "max_score": 0.69955945,
        "hits": [
            {
                "_index": "questions",
                "_type": "_doc",
                "_id": "210905",
                "_score": 0.69955945,
                "_source": {
                    "question": "What is an easy way make money online?"
                }
            },
            {
                "_index": "questions",
                "_type": "_doc",
                "_id": "547612",
                "_score": 0.61820024,
                "_source": {
                    "question": "What is the best way to make passive income online?"
                }
            },
            {
                "_index": "questions",
                "_type": "_doc",
                "_id": "1891",
                "_score": 0.5624176,
                "_source": {
                    "question": "What are the easy ways to earn money online?"
                }
            },
            {
                "_index": "questions",
                "_type": "_doc",
                "_id": "197580",
                "_score": 0.46031988,
                "_source": {
                    "question": "What is the best way to download YouTube videos for free?"
                }
            },
            {
                "_index": "questions",
                "_type": "_doc",
                "_id": "359930",
                "_score": 0.45543614,
                "_source": {
                    "question": "What is the best way to get traffic on your website?"
                }
            }
        ]
    }
}
As you can see, the result was pretty amazing, you can also fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings for k-NN search.
Great! We have what we need! I hope you have found this article useful.
The complete scripts can be found in my GitHub repo.

Tags

The Noonification banner

Subscribe to get your daily round-up of top tech stories!