paint-brush
Transform Search in 5 Minutes: AI-Powered Hybrid Cloud Searchby@techleader
157 reads New Story

Transform Search in 5 Minutes: AI-Powered Hybrid Cloud Search

by Milavkumar ShahJanuary 3rd, 2025
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Hybrid Search combines keyword-based and AI-driven semantic search. It is a way to combine text and vector search with a single search engine. The steps to create and manage indices remain similar to those for traditional search.
featured image - Transform Search in 5 Minutes: AI-Powered Hybrid Cloud Search
Milavkumar Shah HackerNoon profile picture

Introduction

The intelligent search results has become an absolute part of information retrieval in the digital world. Classic keyword-based search engines generally do not account for the actual user's intent or context, leading to missed opportunity and lost revenue, for online companies. Alternatively, Hybrid Search which combines keyword-based and AI-driven semantic search enhances your existing search capabilities in just a few minutes.


This step-by-step guide will show you how to elevate your search functionality using an open-source project, OpenSearch. By adopting these steps, you can turn your search into a complete, working, artificial intelligence solution in 5 minutes.


1. What is Hybrid Search and Why Does It Matter?

Hybrid Search combines:


  1. Keyword-based search: Classic Boolean and text-based approaches to searching (e.g., "bag" results that contain the word "bag").


  2. Semantic or vector-based search: AI techniques that comprehend context (e.g., “backpack” also retrieves “knapsack”, “school bag” etc., as the search engine recognizes their similarity).


Combining these two paradigms, you offer users not only relevance but performance and context-sensitive answers. In the e-commerce domain, this is to present users with exact items they are searching even if they express themselves in a different way. For content searches, it ensures that the most contextually relevant documents appear first.


2. Prerequisites

  1. Docker (recommended for a quick setup) or an existing environment to run OpenSearch.


  2. Some command line experience and a clients like curl or Postman to tests the requests.


  3. A dataset to index (you can try a simple JSON file or use the sample data provided in this article).


Note: For those who would prefer a cloud-based solution, you do not need to follow the local Docker instructions, and you can leverage a managed OpenSearch service available from any major cloud provider. The steps to create and manage indices remain similar.

Run OpenSearch locally using Docker. Copy and paste the following command in your terminal:


docker run -d --name opensearch -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:2.9.0


Here’s what each part does:


-d --name opensearch: Runs the container in the background with the name “opensearch.”
-p 9200:9200 -p 9600:9600: Exposes ports for HTTP requests and performance monitoring.
-e "discovery.type=single-node": Tells OpenSearch to run as a single-node cluster for simplicity.


Verify the container is running by visiting http://localhost:9200. You should see a JSON response with OpenSearch details.

4. Create a Hybrid Index

To enable AI-powered semantic search, you need to store vector embeddings along with traditional keyword fields in your documents. Let’s create an index called products. We’ll set it up with both text and vector fields to showcase a hybrid approach. OpenSearch Documentation Reference


Copy and paste this index mapping:


curl -X PUT "http://localhost:9200/products" 
     -H 'Content-Type: application/json' 
     -d '{ 
        "settings": { 
          "index": { 
            "number_of_shards": 1, 
            "number_of_replicas": 0 
          } 
        }, 
        "mappings": { 
          "properties": { 
            "name": { 
              "type": "text" 
            }, 
            "description": { 
              "type": "text" 
            },
            "keywords": { 
              "type": "keyword" 
            }, 
            "embedding": { 
              "type": "dense_vector", 
              "dims": 768 
            } 
          } 
        } 
      }'


Explanation

  • name and description: Standard text fields for keyword-based search.
  • keywords: A keyword field for filtering or exact matches (e.g., category tags).
  • embedding: A dense vector field for storing AI-generated embeddings. Set dims to 768 to match common BERT transformer model or other language models.

5. Generate AI Embedding and Index

Next we will need to generate an embedding for each document, a numerical representation that reflects the meaning of the text. You can either use an open-source model like Hugging Face) or use a service like OpenAI Embeddings API. Below is an example using Hugging Face's sentence-transformers. OpenSearch Documentation Reference


#!/usr/bin/env python3

import requests
from sentence_transformers import SentenceTransformer

#1. Load a model from Hugging Face

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

#2. Sample documents to index

documents = [ 
  { "name": "Blue Backpack", 
    "description": "A spacious bag suitable for school and travel", 
    "keywords": ["bags", "backpack", "school"],
  }, 
  { "name": "Leather Handbag", 
    "description": "Stylish handbag made from genuine leather", 
    "keywords": ["fashion", "handbag", "leather"], 
  } 
]

#3. Generate embeddings and index

for i, doc in enumerate(documents): 
  textToEmbed = doc["name"] + " " + doc["description"] 
  embedding = model.encode(textToEmbed).tolist()  # Convert to list for JSON serialization

  payload = {
    "name": doc["name"],
    "description": doc["description"],
    "keywords": doc["keywords"],
    "embedding": embedding
  }

  response = requests.post(
    "http://localhost:9200/products/_doc/" + str(i),
    json=payload
  )

  print(f"Indexed document {i}. Status code: {response.status_code}, Response: {response.json()}")


How to run the script:

1. Save above code in a file with name index_documents.py

2. Run pip install sentence-transformers requests if you haven’t already

3. Execute python index_documents.py


This script loads a small transformer model that produces 768-dimensional embeddings, iterates over couple of sample documents, combines text string from name and description, and encodes it into an embedding and then sends each document along with the embedding to OpenSearch via HTTP POST.


Pairing a standard keyword-based match and a vector based similarity query. OpenSearch provides various query types, below is an example using the _search endpoint. OpenSearch Documentation Reference


curl -X POST "http://localhost:9200/products/_search" \
-H 'Content-Type: application/json' \
-d '{
  "size": 5,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name": "bag"
          }
        },
        {
          "match": {
            "description": "bag"
          }
        },
        {
          "script_score": {
            "query": {
              "match_all": {}
            },
            "script": {
              "source": "cosineSimilarity(params.queryVector, doc['"embedding"']) + 1.0",
              "params": {
                "queryVector": replace this text with the user query embedding array
              }
            }
          }
        }
      ]
    }
  }
}'


Considerations:

• script_score: Vector-based scoring using cosine similarity. Add 1.0 to avoid negative scores.

• match: Standard keyword-based queries on name and description

• bool -> should: We use “OR-like” logic such that documents matching any of the clauses are boosted


7. Optimizing and Scaling

  • Cloud Solutions: For achieving automatic scaling, durability, and simplicity of use, you can run OpenSearch On AWS or other cloud services with managed services, such that you don’t have to maintain your own infrastructure.
  • Monitoring: It is possible to visualize live analytics for search performance and user behavior using the OpenSearch Dashboards.
  • Security: Enable TLS/SSL to enable secure communication and granular security access to protect your data.


8. Conclusion

Not only is it easier to create a robust and smart search experience than ever before, but using OpenSearch and cloud-based services, it is now possible to merge familiar keyword-based matching with intelligence-based semantic matching, which will open the door to improved relevancy, enhanced user delight, and potential increased conversions.


Ready to upgrade your search experience? Followed by the simple commands, exploremore with advanced capabilities, such as synonyms, re-ranking and personalized suggestions. The potential is boundless once you unleash the ability of hybrid AI search.


Happy Searching!