Vector search effectively delivers semantic similarity for retrieval augmented generation, but it does poorly with short keyword searches or out-of-domain search terms. Supplementing vector retrieval with keyword searches like BM25 and combining the results with a reranker is becoming the standard way to get the best of both worlds. Rerankers are ML models that take a set of search results and reorder them to improve relevance. They examine the query paired with each candidate result in detail, which is computationally expensive but produces more accurate results than simple retrieval methods alone. This can be done either as a second stage on top of a single search (pull 100 results out of vector search, then ask the reranker to identify the top 10) or, more often, to combine results from different kinds of search; in this case, vector search and keyword search. But how good are off-the-shelf rerankers? To find out, I tested six rerankers on the text from the ViDoRe benchmark, using Gemini Flash to extract text from the images. Details on the datasets can be found in section 3.1 of the ColPali paper. Notably, TabFQuAD and Shift Project sources are in French; the rest are in English. We tested these rerankers: Reciprocal Rank Fusion (RRF), a formula for combining results from multiple sources without knowing anything about the queries or documents; it depends purely on relative ordering within each source. RRF is used in Elastic and LlamaIndex, among other projects. Cohere Rerank v3 and Jina Reranker v2, probably the most popular hosted models. BGE-reranker-v2-m3, the highest-scoring open source model (Apache licensed). Voyage rerank-2 and rerank-2-lite, freshly released (in September) by a solid company. The rerankers were fed the top 20 results from both DPR and BM25, and the reranked NDCG@5 was evaluated. In the results, raw vector search (with embeddings from the bge-m3 model) is labeled dpr (dense passage retrieval). BGE-m3 was chosen to compute embeddings because that’s what the ColPali authors used as a baseline. Here’s the data on relevance (NDCG@5): And here’s how fast they are at reranking searches in the arxiv dataset; latency is proportional to document length. This is graphing latency, so lower is better. The self-hosted bge model was run on an NVIDIA 3090 using the simplest possible code lifted straight from the Hugging Face model card. Finally, here’s how much it cost with each model to rerank the almost 3,000 searches from all six datasets. Cohere prices per search (with additional fees for long documents), while the others price per token. Analysis All the models do roughly as well on the French datasets as they do on the English ones. Cohere is significantly more expensive and offers slightly (but consistently) worse relevance than the other ML rerankers – but it’s 3x faster than the next-fastest services. Additionally, Cohere’s standard rate limits are the most generous. Voyage rerank-2 is the king of reranking relevance in all datasets, for an additional hit to latency. Notably, it’s the only model that does not do worse than DPR alone in the arxiv dataset, which seems to be particularly tricky. Voyage rerank-2-lite and jina reranker v2 are very, very similar: they’re the same speed, hosted at the same price, and close to the same relevance (with a slight edge to Voyage). But Voyage’s standard rate limit is double jina’s, and with Voyage you get a “real” Python client instead of having to make raw http requests. BGE-reranker-v2-m3 is such a lightweight model (under 600M parameters) that even on an older consumer GPU it is usably fast. Conclusion RRF adds little to no value to hybrid search scenarios; on half of the datasets, it performed worse than either BM25 or DPR alone. In contrast, all ML-based rerankers tested delivered meaningful improvements over pure vector or keyword search, with Voyage rerank-2 setting the bar for relevance. Tradeoffs are still present: superior accuracy from Voyage rerank-2, faster processing from Cohere, or solid middle-ground performance from Jina or Voyage's lite model. Even the open-source BGE reranker, while trailing commercial options, adds significant value for teams choosing to self-host. As foundation models continue advancing, we can expect even better performance. But today's ML rerankers are already mature enough to deploy with confidence across multilingual content. By Jonathan Ellis, DataStax Vector search effectively delivers semantic similarity for retrieval augmented generation, but it does poorly with short keyword searches or out-of-domain search terms. Supplementing vector retrieval with keyword searches like BM25 and combining the results with a reranker is becoming the standard way to get the best of both worlds. Rerankers are ML models that take a set of search results and reorder them to improve relevance. They examine the query paired with each candidate result in detail, which is computationally expensive but produces more accurate results than simple retrieval methods alone. This can be done either as a second stage on top of a single search (pull 100 results out of vector search, then ask the reranker to identify the top 10) or, more often, to combine results from different kinds of search; in this case, vector search and keyword search. But how good are off-the-shelf rerankers? To find out, I tested six rerankers on the text from the ViDoRe benchmark , using Gemini Flash to extract text from the images. Details on the datasets can be found in section 3.1 of the ColPali paper . Notably, TabFQuAD and Shift Project sources are in French; the rest are in English. ViDoRe benchmark ViDoRe benchmark the ColPali paper the ColPali paper We tested these rerankers: Reciprocal Rank Fusion (RRF), a formula for combining results from multiple sources without knowing anything about the queries or documents; it depends purely on relative ordering within each source. RRF is used in Elastic and LlamaIndex, among other projects. Reciprocal Rank Fusion (RRF), a formula for combining results from multiple sources without knowing anything about the queries or documents; it depends purely on relative ordering within each source. RRF is used in Elastic and LlamaIndex, among other projects. Reciprocal Rank Fusion Reciprocal Rank Fusion Cohere Rerank v3 and Jina Reranker v2, probably the most popular hosted models. Cohere Rerank v3 and Jina Reranker v2 , probably the most popular hosted models. Cohere Rerank v3 Cohere Rerank v3 Jina Reranker v2 Jina Reranker v2 BGE-reranker-v2-m3, the highest-scoring open source model (Apache licensed). BGE-reranker-v2-m3 , the highest-scoring open source model (Apache licensed). BGE-reranker-v2-m3 BGE-reranker-v2-m3 Voyage rerank-2 and rerank-2-lite, freshly released (in September) by a solid company. Voyage rerank-2 and rerank-2-lite, freshly released ( in September ) by a solid company. Voyage rerank-2 Voyage rerank-2 in September in September The rerankers were fed the top 20 results from both DPR and BM25, and the reranked NDCG@5 was evaluated. In the results, raw vector search (with embeddings from the bge-m3 model) is labeled dpr (dense passage retrieval). BGE-m3 was chosen to compute embeddings because that’s what the ColPali authors used as a baseline. Here’s the data on relevance (NDCG@5): And here’s how fast they are at reranking searches in the arxiv dataset; latency is proportional to document length. This is graphing latency, so lower is better. The self-hosted bge model was run on an NVIDIA 3090 using the simplest possible code lifted straight from the Hugging Face model card . the Hugging Face model card the Hugging Face model card Finally, here’s how much it cost with each model to rerank the almost 3,000 searches from all six datasets. Cohere prices per search (with additional fees for long documents), while the others price per token. Analysis All the models do roughly as well on the French datasets as they do on the English ones. All the models do roughly as well on the French datasets as they do on the English ones. Cohere is significantly more expensive and offers slightly (but consistently) worse relevance than the other ML rerankers – but it’s 3x faster than the next-fastest services. Additionally, Cohere’s standard rate limits are the most generous. Cohere is significantly more expensive and offers slightly (but consistently) worse relevance than the other ML rerankers – but it’s 3x faster than the next-fastest services. Additionally, Cohere’s standard rate limits are the most generous. Cohere’s standard rate limits Cohere’s standard rate limits Voyage rerank-2 is the king of reranking relevance in all datasets, for an additional hit to latency. Notably, it’s the only model that does not do worse than DPR alone in the arxiv dataset, which seems to be particularly tricky. Voyage rerank-2 is the king of reranking relevance in all datasets, for an additional hit to latency. Notably, it’s the only model that does not do worse than DPR alone in the arxiv dataset, which seems to be particularly tricky. Voyage rerank-2-lite and jina reranker v2 are very, very similar: they’re the same speed, hosted at the same price, and close to the same relevance (with a slight edge to Voyage). But Voyage’s standard rate limit is double jina’s, and with Voyage you get a “real” Python client instead of having to make raw http requests. Voyage rerank-2-lite and jina reranker v2 are very, very similar: they’re the same speed, hosted at the same price, and close to the same relevance (with a slight edge to Voyage). But Voyage’s standard rate limit is double jina’s , and with Voyage you get a “real” Python client instead of having to make raw http requests. Voyage’s standard rate limit Voyage’s standard rate limit jina’s jina’s Python client Python client BGE-reranker-v2-m3 is such a lightweight model (under 600M parameters) that even on an older consumer GPU it is usably fast. BGE-reranker-v2-m3 is such a lightweight model (under 600M parameters) that even on an older consumer GPU it is usably fast. Conclusion RRF adds little to no value to hybrid search scenarios; on half of the datasets, it performed worse than either BM25 or DPR alone. In contrast, all ML-based rerankers tested delivered meaningful improvements over pure vector or keyword search, with Voyage rerank-2 setting the bar for relevance. Tradeoffs are still present: superior accuracy from Voyage rerank-2, faster processing from Cohere, or solid middle-ground performance from Jina or Voyage's lite model. Even the open-source BGE reranker, while trailing commercial options, adds significant value for teams choosing to self-host. As foundation models continue advancing, we can expect even better performance. But today's ML rerankers are already mature enough to deploy with confidence across multilingual content. By Jonathan Ellis, DataStax

The is an opinion piece based on the author’s POV and does not necessarily reflect the views of HackerNoon.

Vector Search: A Reranker Algorithm Showdown

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

3 Key Tools for Deploying AI/ML Workloads on Kubernetes

The Noonification: Subjectivity and the Evolution of AI Philosophy (11/22/2023)

The Noonification: The State of Webhooks in 2023 (10/28/2023)

The Noonification: A Game-Changing Leap in Voice AI Technology (10/22/2023)

The Noonification: Go and Protocol Buffers (Quick Tutorial) (10/15/2023)

3 Key Tools for Deploying AI/ML Workloads on Kubernetes

The Noonification: Subjectivity and the Evolution of AI Philosophy (11/22/2023)

The Noonification: The State of Webhooks in 2023 (10/28/2023)

The Noonification: A Game-Changing Leap in Voice AI Technology (10/22/2023)

The Noonification: Go and Protocol Buffers (Quick Tutorial) (10/15/2023)

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps