Vector Databases - Basics of Vector Search and Langchain Package in Python

Written by shyamganesh | Published 2023/09/30
Tech Story Tags: programming | machine-learning | vector-embedding | langchain | faiss | similarity-search | vector-database | hackernoon-top-story | hackernoon-es | hackernoon-hi | hackernoon-zh | hackernoon-vi | hackernoon-fr | hackernoon-pt | hackernoon-ja | hackernoon-ru | hackernoon-tr | hackernoon-ko | hackernoon-de | hackernoon-bn

TLDRIn this article, I will walk you through the basics of vector databases, vector search and Langchain package in python for storing and querying similar vectors. To begin our learning journey, we will start with a key concept named “Embeddings’. Embeddings are a way to represent data to a machine in its own understandable format.via the TL;DR App

In this blog post, I will guide you through the fundamentals of vector databases, vector search, and the Langchain package in Python, which facilitates the storage and retrieval of comparable vectors.

To embark on our journey, we'll commence with a pivotal concept known as "Embeddings."

What Are Embeddings?

In this article, we will explore the concept of embeddings – a fundamental aspect of machine learning that allows us to represent textual data in a format that machines can readily comprehend.

Essentially, embeddings serve as vector representations of text, converting the intricate web of words, sentences, and passages into a numerical format that machines can process.

One pressing issue arises: can humans decipher these embeddings? The answer is no. Embeddings are essentially sequences of numbers that encapsulate textual information.

Consequently, understanding and working with an embedding vector can be a daunting task for humans, while machines handle it with ease.

There exists a myriad of techniques for generating embeddings from text. Notable methods include TF-IDF, Word2Vec, Glove, and BERT, among others. In the contemporary realm of machine learning, sentence embeddings have gained popularity over word embeddings.

These sentence embeddings, representing the essence of a text, can be obtained from pre-trained sentence transformer models like "all-MiniLM-L6-V2" and "all-mpnet-base-v2."

These models produce embeddings with fixed dimensions, typically 384 or 768 dimensions, for a given text.

Now that we have explored the fundamentals of embeddings and various techniques for their generation, let's shift our focus to the challenge of storing these high-dimensional embedding vectors in a vector store.

Vector Stores

A vector data store is a specialized database designed to store high-dimensional embedding representations of diverse data types, including audio, video, images, text, and more.

One of its core functionalities is the ability to efficiently search for vectors within the store that closely resemble a given query vector.

Vector stores streamline the process of storing embeddings and conducting similarity searches among these vectors, simplifying the management and retrieval of high-dimensional data representations.

In the realm of vector embeddings, similarity search involves quantifying the proximity or relatedness between two or more embeddings. Typically, this is accomplished by calculating similarity using various distance metrics. Some commonly used and widely recognized distance metrics include the Euclidean distance, Manhattan distance, and cosine distance, among others. These metrics help us gauge the degree of similarity or dissimilarity between vectors, facilitating effective similarity searches within the data.

As an example, when determining the similarity or closeness between two points on a graph, the Euclidean distance is commonly employed. This distance can be computed using the following formula:

Euclidean Distance = [(x2 - x1)^2 + (y2 - y1)^2]^0.5

Here, (x1, y1) represents the coordinates of the first point, and (x2, y2) represents the coordinates of the second point on the graph. The result of this calculation provides a measure of the spatial separation between the two points, indicating their similarity or dissimilarity in terms of their positions within the graph.

Similarly, in similarity search, embeddings are depicted within a high-dimensional space, with each record serving as a data point.

The line of code below demonstrates how to retrieve the top 5 similar embeddings for a given input data:

distance, n = vector_store.search(input_embedding, k=5)

This code would return a list of the top 5 embeddings that closely resemble the input data, making it a valuable tool for various applications such as recommendation systems, content retrieval, and more.

The vector_store.search() function output comprises the top-k similar embeddings along with their respective distances from the input_embedding. The parameter "k" is a configurable value that determines the number of closest embeddings to retrieve in the similarity search. You can set "k" to control the desired number of results returned by the search operation.

Now that we have established a foundational understanding of vector stores, let's proceed to delve into the LangChain package and FAISS. These tools will further enhance our ability to work with high-dimensional embeddings and perform efficient similarity searches.

LangChain in Python

Put simply, LangChain is a framework designed for building applications that harness the capabilities of Large Language Models (LLMs). For more detailed information and resources, you can refer to the official LangChain documentation.

Conversely, FAISS, which stands for "Facebook AI Similarity Search," is a Python library that provides developers with a rapid and efficient means of searching for similar embeddings, departing from the conventional hash-based approaches.

This library was developed by the Facebook AI team and offers powerful capabilities for similarity search tasks in high-dimensional spaces.

FAISS is an open-source library that can be hosted on your own server.

In our exploration of LangChain and FAISS, we will now aim to implement FAISS within the LangChain framework. Below are some key APIs from LangChain's FAISS integration that we'll be focusing on in this article:

  1. add_documents(): This function allows us to incorporate additional documents into the vector store.

  2. add_embeddings(): It enables the addition of more embeddings to the vector store.

  3. from_documents(): This API returns a VectorStore based on the provided documents.

  4. from_embeddings(): This function provides a FAISS index generated from the given embeddings.

  5. load_local(): Use this to load the FAISS index from disk.

  6. save_local(): It allows you to save a FAISS index to disk.

  7. similarity_search(): This function retrieves documents that are most similar to a given query.

  8. similarity_search_by_vector(): It retrieves documents that are most similar to a given embedding.

These APIs form the foundation for combining LangChain's capabilities with FAISS, enabling you to work with embeddings and perform efficient similarity searches within your applications.

Meta-filtering is a valuable technique used to refine the results of a search query in LangChain FAISS. There are typically two types of meta-filtering: Pre-filtering and Post-filtering. In our framework, we specifically support post-filtering for search queries.

It's important to note that the concept of meta-filtering for search results is a feature available in the LangChain version of FAISS and is not present in the original implementation of FAISS. This post-filtering capability enhances the precision and relevance of search results, offering more tailored outcomes to users.

Armed with the knowledge of LangChain FAISS APIs, let's dive into the Python implementation of LangChain FAISS. This implementation will empower you to work with embeddings, perform similarity searches, and apply post-filtering techniques to fine-tune your search results within the LangChain framework.

Feel free to explore and leverage these capabilities to create applications that harness the power of Large Language Models and advanced similarity search with FAISS.

LangChain FAISS Python Demo

Dataset

In this instructional guide, we have chosen the Stanford Dataset as the base dataset and added a few columns to show the work of meta-filtering in LangChain FAISS.

We will start with installing and importing the required packages

!pip install langchain[all]
!pip3 install Langchain[FAISS]
!pip install faiss-cpu
# WARNING: langchain 0.0.74 does not provide the extra 'faiss'
from langchain import FAISS
import pandas as pd
import numpy as np
import os
import time
import pickle

Now, we will import the dataset:

passage_data = pd.read_csv("/Users/shyam/Python_Programs/Text Similarity Codes/Standford_Edited.csv")
passage_data.drop(columns=["Unnamed: 0"],axis=1, inplace=True)
passage_data

Having the dataset, we need to initialize an embedding function to convert the text data into vectors. We used “sentence-transformers/all-MiniLM-L6-V2“ which generates embeddings in 384 dimensions,

from langchain.embeddings import SentenceTransformerEmbeddings
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

We will set the metadata information. In the metadata, we have added the document ID and document language for each embedding (content),

start=time.time()
metadatas = []
for index, row in passage_data.iterrows():
    doc_meta = {
        "id": row['Id'],
        "language": row['Language']
    }
    metadatas.append(doc_meta)

That’s it… We will build our FAISS index now,

faiss = FAISS.from_texts(passage_data['Content'].tolist(), embedding_function, metadatas)
print("Time Taken --> ", time.time()-start) 

Now, we will try to save the FAISS index to disk and load it back:

faiss.save_local("/Users/shyam/Python_Programs/LangChain_FAISS", "Standford")
loaded_faiss = faiss.load_local("/Users/shyam/Python_Programs/LangChain_FAISS",embedding_function, "Standford")

We are done with building the FAISS index, storing, and loading. It is inference time. We will try to search some queries and test the FAISS index.

First, we will get the top five similar documents that are related to “atheism.“

start=time.time()
loaded_faiss.similarity_search_with_score("What is atheism?",5)

Well, the results are convincing. But still, we will explore more…

This time, we will try a different language query with some meta-filtering. We will ask a Russian query stating the filtering criteria should be {lang: ‘ru_RU‘}.

#Что такое атеизм? - Russian
loaded_faiss.similarity_search_with_score("Что такое атеизм?",5, {"language":"ru_RU"}, 10)

Here, we have mentioned taking the ten most similar documents from the vector store, and then applying our filtering condition to get the top five documents for our query.

Upon concluding this article, I trust you have gained a solid comprehension of vector stores and vector search. We've also provided a demonstration of LangChain FAISS to illustrate its functionality.

I strongly encourage you to explore other vector stores such as ChromaDb, Qdrant, Milvus, and more.

Each vector store comes with its unique set of advantages and drawbacks, so make your selection based on the specific requirements of your use case.

Wishing you an enjoyable and fruitful learning journey!


Written by shyamganesh | An IT professional with a keen interest in delving into the realm of Machine Learning.
Published by HackerNoon on 2023/09/30