From Exact kNN to DiskANN: The Evolution of High-Performance Vector Search

Written by aadityachauhan | Published 2026/03/18
Tech Story Tags: rag | knn | rag-architecture | rag-pipelines | hnsw | vector-index-hnsw | software-architecture | scaling-vector-search

TLDRScaling vector search isn't one problem. It's three. You hit a compute wall first, and you run into a memory wall. And each problem demands a completely different architectural response.via the TL;DR App

Every vector search implementation seems to follow the exact same trajectory: Teams do a proof of concept, it works great on their test setup, they ship it, and 3 months later the cluster is already having performance issues and we are already into our first round of “architectural improvements”. Production scale completely blew off all the calculations. Here's what nobody realizes: scaling vector search isn't one problem. It's three.

  • You hit a compute wall first.

  • Solve that, and you run into a memory wall.

  • Solve that, and there's a cost wall waiting behind it.

    And each problem demands a completely different architectural response. I've spent years building search systems and this is what I have learned.

Act 1: Exact kNN: Beautifully Simple, Brutally Slow

Assume you have a query vector q ∈ ℝ^d and a dataset X = {x₁, x₂, ..., x_N}, each x_i ∈ ℝ^d. Exact k-Nearest Neighbors will check for everything. It computes the distance to every single vector. Return the k closest ones.

For cosine similarity: sim(**q**, **x**_i) = (**q** · **x**_i) / (‖**q**‖ · ‖**x**_i‖)

Query complexity is O(N · d), due to d multiplications and d-1 additions, N times. Let's do some math: Take N = 100,000 and d = 768 (a common dimension for BERT-family models). That's 76.8 million FLOPs. A modern core running AVX-512 at around 2 GHz crunches 16 float32 ops per cycle, so you're looking at roughly 2.4 milliseconds. 2.4 milliseconds is pretty amazing for most production use cases.

ButO(N · d)is linear in N, and this is where the problem lies. At N = 10⁷ you're at 240ms. At 10⁸ you're at 2.4 seconds. You can throw cores at it, for example: 16 cores brings that 2.4 seconds back down to 150ms, but this linear growth curve will always catch up. Jégou et al. [1] called out this limitation clearly in their paper.

What kNN gives you in return is perfect recall. Recall@k = 1.0. For small bounded datasets this is perfect:

import numpy as np

vectors = vectors / np.linalg.norm(vectors, axis=1, keepdims=True)
query = query / np.linalg.norm(query)

similarities = vectors @ query

top_k = np.argpartition(similarities, -k)[-k:]

If you have small dataset then just go ahead with exact knn, no need to over engineer.

Two Common "Solutions" That Don't Solve the Compute Problem

Before moving on, I want to address two approaches that come up constantly when teams first hit kNN scaling issues. Both sound reasonable.

"We can page through the data instead of loading everything at once."

This conflates two separate problems: memory and compute. You can absolutely do chunked kNN i.e. load a batch, compute distances, maintain a running top-k heap, discard the batch, load the next one. Peak RAM stays bounded regardless of N. That is a legitimate memory management technique.

But the compute cost is completely unchanged. You are still touching every single vector. The O(N · d) FLOPs happen whether you process the dataset in one shot or in a thousand pages. At N = 10⁸ with d = 768, you're still doing ~77 billion floating point operations per query, spread across pages or not.

There's also a secondary penalty: paging serializes I/O with CPU work, adding latency on top. With everything in RAM, distance computations hit L3 cache and fast memory. With paging, you wait for disk reads between batches. For any production system with real-time latency requirements, chunked kNN makes the latency problem worse, not better.

The valid use case for chunked kNN is narrow: dataset too large to fit in RAM, query volume is low, and latency requirements are loose. For real-time search, it isn't a solution.

"We shard per customer, so each query only touches one customer's data."

Per-customer sharding reframes the N in O(N · d). If you have 10 million customers but each customer has 50,000 vectors, your effective N is 50,000 not 10 million. Exact kNN at 50K vectors is completely fine, as the numbers above show. You don't need HNSW, quantization, or DiskANN. The complexity doesn't accumulate across shards because queries never cross them.

Where it breaks down is worth being honest about:

  • Customer size distribution is rarely uniform. You'll have whale customers. If your p99 customer has 10M vectors, you're back to the scaling problem for them specifically, even if your median customer is fine.
  • Customers grow. A customer with 100K vectors today might have 5M in two years. The assumption baked into this architecture needs to be revisited periodically.
  • Cross-customer queries add complexity. If you ever need to query across customers, the sharding model adds a lot of complexity.

Act 2: HNSW: The Algorithm Everyone Uses

What if we accept recall@k < 1.0 in exchange for sub-linear search time?

Many algorithms have tried to answer that. LSH, random projection trees, IVF. But the one that dominates production deployments in 2026 is HNSW: Hierarchical Navigable Small World graphs from Malkov and Yashunin's 2018 paper [2]. And it dominates for good reason: it's fast, the recall is excellent, and the open-source implementations (Hnswlib, FAISS, and the ones baked into OpenSearch, Elasticsearch, pgvector) are mature.

What HNSW actually builds

The data structure is a multi-layer graph G = {G₀, G₁, ..., G_L}. Each layer contains a subset of the vectors as nodes, with a strict containment property: V(G_L) ⊂ V(G_{L-1}) ⊂ ... ⊂ V(G₀) = X. The full dataset lives only at layer 0. Each higher layer is exponentially sparser. A node's maximum layer follows a geometric distribution with parameter m_L = 1/ln(M).
The analogy I keep coming back to (because it's the one that I find easy to understand) is navigating a city. The top layer is highways. Middle layers are local roads. The bottom layer is every single street and alley. A query enters at the highway level, hops toward the target region, drops down a layer, refines, drops again. By the time you reach layer 0, you're already in the right neighborhood.

The expected number of distance computations per query: O(log N). For a billion vectors, log₂(10⁹) ≈ 30. Thirty graph hops, each evaluating a few dozen neighbor distances. So a couple thousand distance computations total, versus a billion for brute force. This is what makes real-time vector search possible.

Where the money conversation starts:
HNSW needs everything in RAM. The graph, the vectors, all of it. Let's do the arithmetic that infrastructure teams learn to dread:

  • N = 10⁹ vectors

  • d = 1536 dimensions (OpenAI's text-embedding-3-large)

  • float32 precision: 4 bytes per dimension

    Raw vectors: 10⁹ × 1536 × 4 bytes ≈ 5.7 TB

HNSW's geometric distribution means the vast majority of nodes: over 90% exist only at layer 0, where each node has at most 2M connections. Higher layers are exponentially sparser. The average number of edges per node across the entire graph works out to roughly 2M. So with M = 32, the graph overhead is approximately N × 2M × 4 bytes = 10⁹ × 64 × 4 ≈ 256 GB for adjacency lists, plus metadata and alignment overhead. Call it 300-400 GB for the graph structure.

Total: roughly 6-6.5 TB, dominated overwhelmingly by the raw vector data, with the graph adding another 5-7% on top. A staggering amount of RAM. At $5-8/GB, that's $30K-50K in memory alone, per replica. And then there's the operational part that doesn't show up in any benchmark paper. In OpenSearch and Elasticsearch, every Lucene segment gets its own HNSW graph. Segments merge continuously in the background. During a merge, the JVM holds the source segments AND the new merged segment in memory simultaneously. I have seen (more than once) a merge operation push heap usage just past the tipping point, triggering a GC death spiral. Search latency goes from 5ms to 500ms cluster-wide. Nothing in the application logs explains it. You have to know to look at JVM GC pauses and correlate them with the segment merge schedule.

Act 2.5: Quantization: Buying Time Before the Architectural Leap

There's an intermediate stage between "HNSW is eating all my RAM" and "I need to rethink my entire architecture." Most teams pass through it. It's quantization i.e. compressing vectors so they take up less memory.

Scalar Quantization is the one I recommend first, always. Map each float32 dimension to uint8: x̂_i = round(255 · (x_i - min_i) / (max_i - min_i))
That's a 4× memory reduction, 32 bits down to 8. The recall hit on modern transformer embeddings is tiny, usually 1-2 points, because the per-dimension value ranges are bounded enough that 256 quantization bins preserve meaningful resolution.

Product Quantization [1] goes further. Split the d-dimensional space into m subspaces of d/m dimensions each. Train a k-means codebook (K = 256 centroids, typically) independently in each subspace. Encode each sub-vector as its nearest centroid index: 8 bits per sub-vector. A 1536-dimensional float32 vector is 6,144 bytes. With m = 48 sub-quantizers, it becomes 48 bytes. A 128× compression ratio on the vector data itself. Your billion-vector dataset's raw vector footprint goes from 5.7 TB to roughly 45 GB.
However, PQ only compresses the vectors. The HNSW graph adjacency lists are pointers, not vectors, and they don't compress. At M = 32, those adjacency lists still consume roughly 256-300 GB of RAM. So the realistic total for an HNSW index with PQ-compressed vectors at billion scale is more like 300-350 GB, not 45 GB. A massive improvement over 6+ TB. Binary Quantization is the aggressive end of the spectrum. One bit per dimension: positive or negative. A 1536-d vector becomes 192 bytes. But is only acceptable if you add a re-ranking stage: oversample by a factor c (usually 5-10×), then binary-search for top-cK candidates, then re-rank with full-precision vectors.
Every one of these techniques trades recall for memory along a curve, and at some point the slope gets ugly. You're compressing harder and harder for diminishing memory savings while recall keeps ticking downward. That's when you have to ask: what if I just stopped fighting to keep everything in RAM?

Act 3: DiskANN

DiskANN [4] came out of Microsoft Research in 2019 and it reframes the entire question. Instead of "how do I compress my vectors to fit in memory," it asks "how do I build a graph algorithm that works well with the I/O profile of NVMe SSDs?"
Vamana is not HNSW on disk: This distinction matters and people get it wrong constantly. DiskANN does not take HNSW and put it on an SSD. HNSW's multi-layer structure means unpredictable traversal patterns and random reads scattered across layers, which is exactly what you don't want from disk I/O.

Vamana builds a single flat graph G = (V, E) optimized for a completely different property: minimum diameter with bounded out-degree R. The construction starts from a random R-regular graph and iteratively rebuilds each node's neighbor list through robust pruning. For node v with candidate neighbor set S, the algorithm selects neighbors greedily by distance, but skips any candidate p* if there already exists a selected neighbor p such that: α · d(p*, p) ≤ d(p*, v)
The parameter α ≥ 1 controls how aggressively the pruning favors "directional diversity." At α = 1.0, it's plain greedy. At α = 1.2 (the value Jayaram Subramanya et al. [4] found works consistently well), the algorithm retains neighbors that point in genuinely different directions from v, even if they're slightly farther away. This is what keeps the graph diameter small.

How the pieces fit together

Two tiers.

**RAM tier:**PQ-compressed vectors (for making approximate routing decisions during traversal) plus the full graph adjacency lists. For a billion vectors with 48-byte PQ codes and R = 64 neighbors stored as 4-byte IDs:
Memory = (48 + 64 × 4) × 10⁹ = 304 GB
Compare to 6+ TB for in-memory HNSW.

SSD tier: Full-precision vectors, laid out aligned to 4 KB page boundaries. A single 1536-d float32 vector is 6,144 bytes and fits in two page reads.

Search does a beam search with width W over the in-memory graph, routing by PQ distances. For each promising candidate, it fires an async I/O to fetch the full vector from SSD. Modern NVMe drives do random 4 KB reads at 500K-1M IOPS, ~100μs each.

Now, a critical detail about the I/O pattern: the hops are sequential, not parallel. You cannot determine which nodes to fetch at hop k+1 until you've evaluated the distances from hop k, a strict data dependency chain. What you can parallelize is the W reads within a single hop. So the latency model is more like:
total I/O latency ≈ num_hops × (per-read latency for W parallel reads)

With ~50-100 hop iterations and 100μs per hop (NVMe random read with sufficient queue depth), you're looking at 5-10 milliseconds of I/O latency. DiskANN mitigates this partly by using PQ distances from the in-memory tier for initial routing, many hops are resolved entirely in RAM, and only the most promising candidates trigger actual disk reads. Jayaram Subramanya et al. [4] report that the effective number of disk reads per query is typically 30-70 for billion-scale datasets, not one per hop. Still, the latency is meaningfully higher than pure in-memory HNSW, and I'd be skeptical of anyone quoting sub-millisecond DiskANN latencies at billion scale.

Where DiskANN will bite you

I need to be honest about the failure modes because I've watched teams adopt this expecting it to be a plug-and-play HNSW replacement. It isn't.

Writes are painful. Vamana assumes a static dataset. Adding or removing vectors means restructuring the graph I.e. finding neighbors, updating adjacency lists, maintaining the low-diameter invariant. It's nothing like HNSW's relatively graceful concurrent inserts. Singh et al. [5] proposed FreshDiskANN with a streaming two-tier approach, but "streaming billion-scale graph updates" is operationally complex no matter how you cut it.
**Metadata filters break things.**Your users want "find me similar items, but only in category X." The Vamana graph has no concept of categories. Post-filter and you waste I/O retrieving vectors that don't match. Pre-filter and the graph fragments into disconnected subgraphs with terrible recall. Gollapudi et al. [6] proposed label-partitioned graph variants, but honestly, achieving consistently high recall under arbitrary filter predicates remains one of the hardest unsolved problems in this space.
Tail latency is a different beast. Average and p50 latency look great. p99 and p999 depend on things HNSW never makes you think about: SSD queue depth, NVMe thermal throttling, kernel I/O scheduler behavior, whether some other process is also hammering the same drive. For strict tail-latency SLAs (say, sub-10ms p99), you need to over-provision IOPS significantly.

DiskANN isn't the only disk-based approach either. SPANN [7] uses hierarchical clustering to partition the vector space, stores partitions on disk, and routes queries via an in-memory centroid index which works well when your data has natural cluster structure. HM-ANN [8] keeps HNSW's multi-layer design but pushes the fat bottom layers to SSD. GPU-accelerated search (FAISS-GPU, cuVS) is another axis entirely. The field hasn't converged yet. I'd be suspicious of anyone who tells you it has.

The Decision Framework

Scale

What to Use

Things That Will Bite You If You Ignore Them

< 10⁵

Exact kNN. NumPy, scikit-learn, your database's distance functions.

Over-engineering. A vector DB for 50K vectors is a net negative.

10⁵ – 5×10⁷

HNSW + scalar quantization. FAISS, Hnswlib, OpenSearch, pgvector.

Set ef_construction high at build time. You cannot fix a bad graph later. Enable SQ from day one.

5×10⁷ – 5×10⁸

HNSW + PQ, or start evaluating DiskANN. Depends on your recall floor and latency budget.

PQ codebook quality degrades when your embedding distribution shifts. Monitor recall continuously.

> 5×10⁸

DiskANN, or a hybrid (HNSW for recent/hot vectors, DiskANN for the tail).

Index rebuild takes hours at this scale. SSD write endurance matters. Plan your deployment pipeline.

References

[1] H. Jégou, M. Douze, and C. Schmid, "Product Quantization for Nearest Neighbor Search," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128, 2011.
[2] Y. A. Malkov and D. A. Yashunin, "Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 42, no. 4, pp. 824–836, 2020.
[3] R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar, "Accelerating Large-Scale Inference with Anisotropic Vector Quantization," Proc. ICML, 2020.
[4] S. Jayaram Subramanya, F. Devvrit, H. V. Simhadri, R. Krishnaswamy, and R. Kadekodi, "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node," Proc. NeurIPS, 2019.
[5] A. Singh, S. Jayaram Subramanya, R. Krishnaswamy, and H. V. Simhadri, "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search," arXiv:2105.09613, 2021.
[6] S. Gollapudi et al., "Filtered-DiskANN: Graph Algorithms for Approximate Nearest Neighbor Search with Filters," Proc. ACM Web Conference, 2023.
[7] Q. Chen, B. Zhao, H. Wang, M. Li, C. Liu, Z. Li, M. Yang, and J. Wang, "SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search," Proc. NeurIPS, 2021.
[8] S. Ren, Y. Zhao, and G. H. Chen, "HM-ANN: Efficient Billion-Point Nearest Neighbor Search on Heterogeneous Memory," Proc. NeurIPS, 2020.


Written by aadityachauhan | Senior Software Engineer building enterprise search, RAG pipelines, distributed systems and data infrastructure.
Published by HackerNoon on 2026/03/18