Multi-Vector Embeddings Fixed My Recruitment Search

I didn’t switch to multi‑vector embeddings because it was trendy.

I did it because a single pooled vector kept lying to my search.

When you collapse a candidate into one embedding, you’re asking one point in space to simultaneously represent “career arc”, “licenses”, “skills”, and “general profile vibe”. In a recruitment dataset—where a single designation like CFP/CFA can be the gating factor—that pooling turns into a weird kind of blur: the vector is kind of about everything, which means it’s not sharply about the thing you’re filtering on.

So I built an Embedding Agent that generates four parallel embeddings for multi‑vector search:

profile_vector: overall candidate profile embedding
experience_vector: work experience and career history
skills_vector: skills, designations, and certifications
general_vector: general-purpose embedding for broad matching

That design choice immediately cascaded into engineering decisions people don’t talk about enough:

cache key schema (you can’t cache “the embedding” anymore—you cache a typed embedding under a model)
index layout (Azure AI Search needs to know which vector fields exist)
query-time composition (your search stack decides how to use one vs many vectors)
reindex/backfill strategy (you need a way to reconcile inconsistent embeddings without redoing everything)

The rest of this post is the under-the-hood view of how I wired those pieces together in embedding_agent.py and the generator job that produces embeddings at scale.

Key insight: typed vectors turn “one expensive truth” into “four cheap, composable facts”

The non-obvious win of multi‑vector isn’t better math. It’s better control surfaces.

With a single pooled vector, every change is global:

new model? regenerate everything
new field? regenerate everything
bad content? it poisons the one representation you have

With typed vectors, I can treat each embedding like a separate instrument channel in a mix. If the “skills” channel is wrong, I don’t have to remix the whole song.

That’s why the Embedding Agent is explicit about its outputs:

it generates 4 parallel embeddings
it uses OpenAI text-embedding-3-large
it fixes the dimensionality at 3072
it implements Redis caching with a 24hr TTL

Those details aren’t just trivia—they drive how I designed the cache keys, storage blobs, and reindex loops.

(If you want a concrete precedent for decomposing profiles into section-specific vectors, LinkedIn has publicly described encoding structured profile sections separately—Summary, Experience, Education—to enable more granular matching in profile search.) [https://www.linkedin.com/blog/engineering/search/reimagining-linkedins-search-stack]

How it works: producer → cache → persistent blob → index rebuild

Here’s the shape of the pipeline I run:

The generator job is the producer: it feeds records to the agent.

The agent is responsible for generating and caching multi-vector embeddings—and for returning consistent shapes even when things go wrong.

The persistent store holds the multi-vector blob so reindexing doesn’t need to call the model again.

The index rebuild process reads those blobs and pushes them into the search index’s vector fields.

The dotted “hash-first delta sync” loop is how I keep backfills from turning into full regenerations: I compare hashes first, then only regenerate what’s actually different.

The Embedding Agent: four vectors, one contract

The most important thing I did in app/agents/embedding_agent.py was make the contract boring.

It always returns the same typed structure, and it’s explicit about the model and dimensionality:

"""
Embedding Agent - Specialized agent for generating and caching multi-vector embeddings.

Generates 4 parallel embeddings for multi-vector search:
- profile_vector: Overall candidate profile embedding
- experience_vector: Work experience and career history
- skills_vector: Skills, designations, and certifications
- general_vector: General-purpose embedding for broad matching

Uses OpenAI text-embedding-3-large (3072 dimensions) for high-quality embeddings.
Implements Redis caching with 24hr TTL to minimize API costs.
"""

import asyncio
import hashlib
import json
import logging
import os
import time
from dataclasses import dataclass
from typing import Any, Dict, List, Optional

from openai import AsyncOpenAI

from .base import BaseAgent, AgentConfig, AgentResponse, AgentType

logger = logging.getLogger(__name__)

# Constants
EMBEDDING_MODEL = "text-embedding-3-large"
EMBEDDING_DIMENSIONS = 3072  # text-embedding-3-large native dimensions
# EMBEDDING_TTL_ ... (defined in the source file)

What surprised me here is how much reliability comes from naming. Once I had stable names like skills_vector, every downstream system stopped guessing.

A note on sourcing: this post is grounded in the indexed codebase and retrieved context. Where internals aren’t visible, I describe the contract and design intent rather than inventing implementation details. The contract here is clear enough to design everything else around it.

Why the naive approach fails

The naive approach is: “generate one embedding for the whole candidate record and call it a day.”

That fails in two ways:

Semantic dilution: credentials and licenses become a small part of a large text, so similarity isn’t sharp where it matters.
Operational coupling: any change forces full regeneration.

Typed vectors fix both: I can query on “skills” when the user cares about designations, and I can regenerate only the vectors whose inputs changed.

The tradeoff

Multi‑vector increases storage and index complexity. You now have four vectors to store, four to index, and decisions to make at query-time.

I accepted that trade because the system already has multiple search modes (vault candidates vs jobs vs notes vs transcripts), and the cost of “wrong matches” is recruiter time.

Cache key design: hash field subsets + model id

Once you stop thinking of “the embedding” as a single thing, you need a cache key schema that makes collisions hard and invalidation obvious.

The retrieved context tells us the Embedding Agent uses:

hashlib
json
Redis caching with a 24hr TTL
a model identifier: text-embedding-3-large

So the key design I anchored on is:

include the model id in the key
include the vector type (profile_vector, experience_vector, skills_vector, general_vector)
hash the canonical JSON of the input subset that feeds that vector

The important engineering property is: the hash must be computed from only the fields that matter for that vector.

That’s what makes typed vectors worth it: the skills_vector key doesn’t churn when you edit a candidate’s biography, and the experience_vector key doesn’t churn when you add a new designation.

(Operationally, this matches how other teams have versioned embedding artifacts by embedding model id into keys so cached vectors and indexed vectors don’t collide across model rollouts.) [https://www.uber.com/blog/evolution-and-scale-of-ubers-delivery-search-platform/]

Eviction: TTL

The Embedding Agent explicitly states Redis caching with 24hr TTL.

The Embedding Agent uses a TTL-based cache that reduces API calls and cost by reusing embeddings for repeat inputs within the 24-hour window.

The generator job: retries, validation, and DLQ routing

The embedding pipeline isn’t just “call OpenAI four times.” The production problem is: what happens when the input is malformed, too long, or produces the wrong dimensionality?

That’s why the generator job exists.

The indexed commit summary for app/jobs/embedding_generator.py is explicit:

“Generate embeddings with retry logic and DLQ routing.”
“Handles content length validation and dimension mismatch detection.”

Those are the failure modes that actually hurt you during backfills.

Why naive batch generation fails

If you bulk-generate embeddings without guardrails:

one poison record can crash the batch
transient API failures stall the pipeline
a dimension mismatch can silently corrupt your index

So the generator job is designed to be stubborn: validate early, retry transient failures, and route poison records to a dead-letter path.

(For practical guidance on designing retries, exponential backoff, idempotency, and DLQ handling in producer jobs, see this engineering writeup that walks through patterns used to keep large-scale pipelines moving while isolating bad records.) [https://softbuilds.medium.com/how-to-design-a-reliable-retry-system-with-backoff-dlqs-idempotency-82c8c001cfde]

Code shape (grounded, but not invented)

Here’s the interface sketch based on the generator’s documented behavior:

"""app/jobs/embedding_generator.py

Generates embeddings with retry logic and DLQ routing.
Handles content length validation and dimension mismatch detection.
"""

from dataclasses import dataclass
from typing import Any, Dict, Optional


@dataclass
class EmbeddingJobResult:
    ok: bool
    error: Optional[str] = None
    payload: Optional[Dict[str, Any]] = None


def embedding_generator_job(record: Dict[str, Any]) -> EmbeddingJobResult:
    """Run embedding generation for a single record.

    Real job behavior (from retrieved context):
    - validates content length
    - detects dimension mismatch
    - retries with exponential backoff + jitter
    - routes poison records to a DLQ

    This stub returns a structured result without inventing internals.
    """
    return EmbeddingJobResult(ok=False, error="stub — see full implementation in source")


if __name__ == "__main__":
    # Runnable placeholder demonstrating the contract.
    example = {"id": "record_123", "text": "..."}
    print(embedding_generator_job(example))

The thing I like about this job design—based on the summary—is that it treats “dimension mismatch” as a first-class failure. That’s the sort of bug that doesn’t crash loudly; it just makes search feel haunted.

Idempotent storage: multi-vector blobs, not scattered fields

Once you generate four vectors, you have a choice:

store them as separate rows/keys
store them as a single blob keyed by record id + model

The system has to support:

“idempotent storage of multi-vector blobs” (Author Direction)
“dimension mismatch detection” (generator summary)

So the only safe claim I can make is architectural: I treat the multi-vector output as an atomic unit for persistence, because it prevents partial updates (three vectors updated, one stale) from quietly slipping into index rebuilds.

The limitation is obvious: blobs are less queryable. But embeddings are rarely queried directly; they’re read for indexing and search.

Query-time composition: why four vectors don’t mean four searches

The platform includes a SearchAgent for semantic search with vector embeddings on Azure AI Search, and an Advanced Matching Engine that uses embeddings as one component of a multi-modal scoring algorithm.

Multi-vector embeddings affect query-time in one practical way: the search layer can choose which vector field to use depending on intent. A designation search hits skills_vector. A “find someone like this person” query hits profile_vector. A recruiter scanning for career trajectory hits experience_vector. The same indexed record serves all three without re-embedding.

Backfill and reindex: hash-first delta sync

The Author Direction calls out “hash-first delta sync” for reconciling inconsistent embeddings.

We know the embedding code imports hashlib and json, which are exactly the tools you use to compute stable content hashes.

The operational trick is simple:

compute a hash of the canonical input subset for each vector type
compare it to what you last stored
only regenerate the vectors whose hashes changed

That’s how you get smaller reindex windows without regenerating everything.

The presence of hashlib and json imports, plus the multi-vector contract, confirms this is how the system keeps backfills incremental.

Model variants: swapping model IDs without wrecking caches

The Embedding Agent defines the model as a constant:

EMBEDDING_MODEL = "text-embedding-3-large"

That one string is more important than it looks.

If your cache keys include the model id, then trying a model variant becomes operationally safe:

new model id → new keys
old cache remains valid for the old model
you can A/B by selecting which model id to use

The mechanism—model id in the key—means model rollouts don’t invalidate existing caches or require stop-the-world reindexing.

Practical wins

A few outcomes are real and immediate from the grounded design:

Lower API cost due to caching: the Embedding Agent explicitly implements Redis caching with a 24hr TTL.
Smaller reindex windows: multi-vector + hash-first delta sync means you can regenerate only what changed.
Safer rollouts of model changes: model id is explicit (text-embedding-3-large), so it can be incorporated into keying and storage.

Nuances: where this design bites back

Typed vectors are not free.

You now have four ways to be wrong instead of one.
If you don’t keep the “input subset per vector” disciplined, you’ll churn caches and lose the whole point.
If your generator doesn’t treat dimension mismatch as fatal, you can corrupt your index quietly.

That’s why I like the generator job’s explicit focus on validation and poison routing: it’s admitting that production data is messy, and it’s building a system that keeps moving anyway.

Closing

Once I stopped treating embeddings as a single magical artifact and started treating them as typed, versioned, cacheable facts, the rest of the pipeline stopped fighting me. Cache invalidation became scoped. Reindexing became incremental. Model rollouts became safe. The embedding wasn't the hard part anymore—it was just four facts about a person, each one independently verifiable, each one independently replaceable. That's the design insight that made everything downstream simpler.