Simplifying Vector Search: Part 1

Written by anywhichway | Published 2023/06/02
Tech Story Tags: vector-search | large-language-models | cosine-similarity | lmdb | llms | search | vector | tutorial | web-monetization

TLDRPart One of a two-part series on vector search and language processing. The first part of this series is about how to vector search works to find matches for data. Part Two of the series will be on how to code vector search.via the TL;DR App

Ever wondered how Spotify or any number of dating sites figure out what or who you like? Wonder no more. because they use vector-based searching.

With the growing interest in Large Language Models, there has been a corresponding surge in interest in vector databases. However, explanations of how vector databases work often tend to focus on language processing and utilize terms like "embeddings," "n-dimensionality," and other language-specific nuances. This can make understanding these explanations challenging.

If you're looking to understand how both Spotify and ChatGPT work without getting caught up in the complexities of natural language processing, Part One of this series is for you. It provides simpler examples and avoids unnecessary complexities such as n-grams, stop words, pluralization, tensing, and so on. Natural language processing is briefly discussed, but the emphasis is on using simpler examples to facilitate understanding. Please note that Part Two of the series, to be posted soon, may also be of interest to you if you have coding knowledge.

Second, when one incorporates “database” in the mix, things like data validation and indexing tend to obscure some of the key principles, so I focus on just searching, i.e. finding matches for data.

What Are Vectors?

In the context of vector search, vectors are one-dimensional lists of numbers. That’s right “one-dimensional”. All the talk of “n-dimensions” can be confusing, it really refers to the number of elements in the list because those are the inputs to formulas that calculate similarity based on distance (closer is more similar) and computing distance is something most people have only learned to do in one, two dimensions or three, e.g. using data of the form [x] ,[x,y] and [x,y,z].

Let’s use a 1-dimensional distance as an example. If there are three people at these distances from you, which is the most similar in location to you?

Joe: 1 block
Mary: 10 blocks
Bill: 20 blocks

I hope you said “Joe”! You can think of each of these as a vector:

Joe: [1]
Mary: [10]
Bill: [20]

Now what if someone told you how far each person was in terms of distances North, South, East, and West?

Joe:[2N,3W]
Mary:[2.5N,2E]
Bill:[2.5S,2.25N]

Remember the Pythagorean Theorem from high-school math, a^2 + b^2 = c^2? Drop the directional portions of the numbers because they are not relevant to calculating distance, and you will see Mary is now closest, as a bird flies, i.e. has the most similar location.

Joe: [2,3] = sqrt(2² + 3²) = 3.6
Mary: [2.5,2] = sqrt(2.5² + 2²) = 3.2
Bill: [2.5,2.25] = sqrt(2.5² + 2.25^) = 3.36

The result is called the Euclidean Distance and it turns out that this formula (and other distance formulas) can be applied regardless of how many numbers are in the array.

Although the generalized nature of the formula shown below may not look like the Pythagorean Theorem and the result is not what you and I may think of as distance, it is mathematically a distance.

You can find more info on this math in the Wikipedia article on Euclidean Distance. There are also other distance formulas, one is even called Manhattan Distance. It is effectively the same as treating each item in a vector as the number of blocks walked in Manhattan (or any other city with a grid of streets :-).

Handling Non-Numeric Data

So what about non-numeric data? How can it be converted into numbers for use by distance formulas?

Given these kinds of vehicles: compact, truck, sub-compact, standard, SUV; which is closest to an SUV, a compact or a truck? Most people would say a truck.

The easiest thing to do is just ensure values are in an order that implies a similarity, e.g. [sub-compact, compact, standard, SUV, truck] and use the position differences as distance.

sub-compact: 1
compact: 2
standard: 3
SUV: 4
truck: 5
SUV to compact = absolute(4–2) = distance 2
SUV to truck = absolute(4–5) = distance 1

Of course, there are some values that do not have a straight forward relative ordering, e.g. car models like Corvair, Bell Air, and Camry. We will keep our example simple and ignore these; however, there are advanced language processing techniques that could analyze large volumes of data to create an ordered similarity vector for car models.

There are also some other values, like colors, that have special treatment. I will cover those later.

An Example

Let’s continue with cars and use choosing a used car as an example of vector search. Assume these cars are available:

{model:"Corvair",type:"compact",year:1964,color:"red",price:3500},
{model:"Corvair",type:"compact",year:1965,color:"orange",price:4500},
{model:"Bel Air",type:"standard",year:1957,color:"red",price:3500},
{model:"Bel Air",type:"standard",year:1958,color:"blue",price:4500},
{model:"Camry",type:"compact",year:2018,color:"black",price:25000},
{model:"Camry",type:"compact",year:2019,color:"green",price:35000}

If you want a $3,750 car, the 1964 Corvair and the 1957 Bel Air are closest. They have the same difference in price, $250.

If you want a 1959 $3,750 car, the 1957 Bel Air is closest since it is closer in years, 2, than the Corvair, 5.

Disregardingc the values for model and color while only allowing for types compact and standard, the data above would look like the below. Position one is the type (1 for compact and 2 for standard), position two is the year, and position 3 is the price.

Corvair 1: [ 1, 1964, 3500 ]
Corvair 2: [ 1, 1965, 4500 ]
Bel Air 1: [ 2, 1957, 3500 ]
Bel Air 2: [ 2, 1958, 4500 ]
Camry 1: [ 1, 2018, 25000 ]
Camry 2:[ 1, 2019, 35000 ]

The distance between each item can be represented as a table:

Car

Corvair 1

Corvair 2

Bel air 1

Bel Air 2

Camry 1

Camry 2

Corvair 1

0

1000

7

1000

21500

31500

Corvair 2

1000

0

1000

7

20500

30500

Bel Air 1

7

1000

0

1000

21500

31500

Bel Air 2

1000

7

1000

0

20500

30500

Camry 1

21500

20500

21500

20500

0

10000

Camry 2

31500

30500

31500

30500

10000

0

Normalizing Values

It probably seems odd that Corvair 1 is so much closer to Bel Air 1 than to Corvair 2. It is closer because the year is on a very different scale than the price. One dollar does not equal one year. To address this, numbers can be normalized to a range from zero to one.

For regular numbers, we can use the formula:

(number - min possible number)/(max possible number - min possible number)

For ordered lists of strings, we can use the formula:

position of string / number of strings

I will use 1886 (when Benz patented the car) and next year as the minimum and maximum for car year and the minimum and maximum of prices in the dataset itself as the minimum and maximum for the car price. Our vectors now look like this:

Corvair 1: [ 0.5, 0.11290322580645161, 0 ]
Corvair 2: [ 0.5, 0.12903225806451613, 0.031746031746031744 ]
Bel Air 1: [ 1, 0, 0 ]
Bel Air 2: [ 1, 0.016129032258064516, 0.031746031746031744 ]
Camry 1: [ 0.5, 0.9838709677419355, 0.6825396825396826 ]
Camry 2: [ 0.5, 1, 1 ]

And our distance table now looks like this:

Car

Corvair 1

Corvair 2

Bel air 1

Bel Air 2

Camry 1

Camry 2'

Corvair 1

0

0.2

0.61

0.73

1.51

1.77

Corvair 2

0.2

0

0.65

0.64

1.42

1.66

Bel Air 1

0.61

0.65

0

0.4

1.55

1.8

Bel Air 2

0.73

0.64

0.4

0

1.42

1.64

Camry 1

1.51

1.42

1.55

1.42

0

0.38

Camry 2

1.77

1.66

1.8

1.64

0.38

0

As you can see, the Corvairs are now closer to each other than they are to Bel Airs.

Searching Data

To search data, a vector is created to represent the search criteria, e.g.

{type:"standard",year:1959,price:3750} converts to [ 1, 0.519, 0.008 ]

Now we can calculate the distance from our search vector to the vectors for each car. Below you will see that the 1957 Bel Air is the closest match, but the 1958 is not far behind.

[
  {
    description: {
      make: 'Chevy',
      type: 'standard',
      model: 'Bel Air',
      year: 1957,
      color: 'red',
      price: 3500
    },
    distance: 0.01680675150716255
  },
  {
    description: {
      make: 'Chevy',
      type: 'standard',
      model: 'Bel Air',
      year: 1958,
      color: 'blue',
      price: 4500
    },
    distance: 0.02493517813322366
  },
  {
    description: {
      make: 'Chevy',
      type: 'compact',
      model: 'Corvair',
      year: 1964,
      color: 'red',
      price: 3500
    },
    distance: 0.5014326777053019
  },
  {
    description: {
      make: 'Chevy',
      type: 'compact',
      model: 'Corvair',
      year: 1965,
      color: 'orange',
      price: 4500
    },
    distance: 0.5025357719267471
  },
  {
    description: {
      make: 'Toyota',
      type: 'compact',
      model: 'Camry',
      year: 2018,
      color: 'black',
      price: 25000
    },
    distance: 0.9466207344690871
  },
  {
    description: {
      make: 'Toyota',
      type: 'compact',
      model: 'Camry',
      year: 2019,
      color: 'green',
      price: 35000
    },
    distance: 1.1965453758561524
  }
]

Handling Importance

What if the year is more important than the price to the buyer? We can simply provide an importance factor or weight to the search vector.

{type:"standard",year:{weight:1.5,value:1959},price:3750} 
converts to
[ 1, 0.778, 0.008 ]

Note that 0.778 above is just 1.5 * 0.519 (the original normalized value for year).

Now the 1958 Bel Air comes out on top:

[
  {
    description: {
      make: 'Chevy',
      type: 'standard',
      model: 'Bel Air',
      year: 1958,
      color: 'blue',
      price: 4500
    },
    distance: 0.26772748184515416
  },
  {
    description: {
      make: 'Chevy',
      type: 'standard',
      model: 'Bel Air',
      year: 1957,
      color: 'red',
      price: 3500
    },
    distance: 0.27418896082407707
  },
...
]

Handling Special Values

Some values, like color and geographic location, can be represented as vectors themselves. Colors on computers are a combination of red, green, and blue. You have probably seen the notation RGB(0,255,125) before. This can be represented as [0,255,125] Normalized it would be [0/256,255/256,125/256] which is [0,.996,.488]. Now, any time we encounter a color we can insert 3 normalized numeric values corresponding to the RGB value of the color into our vector.

Searching for a car, {color: "turquoise"}(none of which exist) will provide ordered results with blue or green cars at the top.

[
  {
    description: {
      make: 'Chevy',
      type: 'standard',
      model: 'Bel Air',
      year: 1958,
      color: 'blue',
      price: 4500
    },
    distance: 0.9319894757612032
  },
  {
    description: {
      make: 'Toyota',
      type: 'compact',
      model: 'Camry',
      year: 2019,
      color: 'green',
      price: 35000
    },
    distance: 0.93277294100822
  },
  {
    description: {
      make: 'Chevy',
      type: 'compact',
      model: 'Corvair',
      year: 1965,
      color: 'orange',
      price: 4500
    },
    distance: 1.1313300702257503
  },
  {
    description: {
      make: 'Toyota',
      type: 'compact',
      model: 'Camry',
      year: 2018,
      color: 'black',
      price: 25000
    },
    distance: 1.2247354538630986
  },
  {
    description: {
      make: 'Chevy',
      type: 'compact',
      model: 'Corvair',
      year: 1964,
      color: 'red',
      price: 3500
    },
    distance: 1.413511990623187
  },
  {
    description: {
      make: 'Chevy',
      type: 'standard',
      model: 'Bel Air',
      year: 1957,
      color: 'red',
      price: 3500
    },
    distance: 1.413511990623187
  }
]

There are also special functions for converting sounds, images, and documents into standardized vectors.

Large Language Models

Most large language models process massive databases of text and create a vast number of vectors (billions) that correlate to the count and positioning of word fragments (n-grams, de-tensed words), words, sentence fragments, or even concepts relative to each other (extraction of concepts is another topic). These vectors are then indexed (this gets way more complex than vector search) or even have their distances pre-computed (an expensive combinatoric process).

When you type something into many LLMs, a search vector is created and a list of most likely next words, sentence fragments, or sentences are retrieved based on their distance. An LLM will not always take the first item it uncovers and the process is iterative. It will randomly select a result from the top few in the list and then use it as part of another search to add more text. The random selection gives the output a more humanlike feel.

In Summary

  1. Vectors are just lists of numbers.
  2. Strings can be mapped to numbers.
  3. Some values, like color, can be represented as vectors themselves.
  4. Numbers should typically be normalized to a range of 0 to 1.
  5. Almost any data can be mapped to vectors. There are special functions for converting sound, images, and documents into vectors.
  6. Dimensionality in the context of vector search is just the number of items in the list of numbers representing the underlying data.
  7. Standard distance formulas can be applied to more than 2 or 3 dimensions so that the distance between two vectors can be computed.
  8. Something that is a short distance away is more similar.

I hope you have found this article useful in understanding vector-based search.

If you are a programmer, you can explore the $distance operator in LMDB Index, an extension to the highly performant, ACID-compliant, key-value store LMDB. You can also watch for Part Two, which will include simplified code focused just on vector search.

Also published here.


Written by anywhichway | Working in the clouds around Seattle on open source projects. Sailing when it's clear.
Published by HackerNoon on 2023/06/02