Multi-Vector Embeddings Fixed My Recruitment Search

I didn’t switch to multi‑vector embeddings because it was trendy. I did it because a single pooled vector kept lying to my search. When you collapse a candidate into one embedding, you’re asking one point in space to simultaneously represent “career arc”, “licenses”, “skills”, and “general profile vibe”. In a recruitment dataset—where a single designation like CFP/CFA can be the gating factor—that pooling turns into a weird kind of blur: the vector is kind of about everything, which means it’s not sharply about the thing you’re filtering on. kind of about everything not sharply about the thing you’re filtering on So I built an Embedding Agent that generates four parallel embeddings for multi‑vector search: four parallel embeddings for multi‑vector search profile_vector: overall candidate profile embedding experience_vector: work experience and career history skills_vector: skills, designations, and certifications general_vector: general-purpose embedding for broad matching profile_vector: overall candidate profile embedding profile_vector experience_vector: work experience and career history experience_vector skills_vector: skills, designations, and certifications skills_vector general_vector: general-purpose embedding for broad matching general_vector That design choice immediately cascaded into engineering decisions people don’t talk about enough: cache key schema (you can’t cache “the embedding” anymore—you cache a typed embedding under a model) index layout (Azure AI Search needs to know which vector fields exist) query-time composition (your search stack decides how to use one vs many vectors) reindex/backfill strategy (you need a way to reconcile inconsistent embeddings without redoing everything) cache key schema (you can’t cache “the embedding” anymore—you cache a typed embedding under a model) a typed embedding under a model index layout (Azure AI Search needs to know which vector fields exist) query-time composition (your search stack decides how to use one vs many vectors) reindex/backfill strategy (you need a way to reconcile inconsistent embeddings without redoing everything) The rest of this post is the under-the-hood view of how I wired those pieces together in embedding_agent.py and the generator job that produces embeddings at scale. embedding_agent.py Key insight: typed vectors turn “one expensive truth” into “four cheap, composable facts” The non-obvious win of multi‑vector isn’t better math. It’s better control surfaces. control surfaces With a single pooled vector, every change is global: new model? regenerate everything new field? regenerate everything bad content? it poisons the one representation you have new model? regenerate everything new field? regenerate everything bad content? it poisons the one representation you have With typed vectors, I can treat each embedding like a separate instrument channel in a mix. If the “skills” channel is wrong, I don’t have to remix the whole song. That’s why the Embedding Agent is explicit about its outputs: it generates 4 parallel embeddings it uses OpenAI text-embedding-3-large it fixes the dimensionality at 3072 it implements Redis caching with a 24hr TTL it generates 4 parallel embeddings 4 parallel embeddings it uses OpenAI text-embedding-3-large OpenAI text-embedding-3-large it fixes the dimensionality at 3072 3072 it implements Redis caching with a 24hr TTL Redis caching with a 24hr TTL Those details aren’t just trivia—they drive how I designed the cache keys, storage blobs, and reindex loops. (If you want a concrete precedent for decomposing profiles into section-specific vectors, LinkedIn has publicly described encoding structured profile sections separately—Summary, Experience, Education—to enable more granular matching in profile search.) [https://www.linkedin.com/blog/engineering/search/reimagining-linkedins-search-stack] https://www.linkedin.com/blog/engineering/search/reimagining-linkedins-search-stack How it works: producer → cache → persistent blob → index rebuild Here’s the shape of the pipeline I run: The generator job is the producer: it feeds records to the agent. The agent is responsible for generating and caching multi-vector embeddings—and for returning consistent shapes even when things go wrong. generating and caching multi-vector embeddings The persistent store holds the multi-vector blob so reindexing doesn’t need to call the model again. multi-vector blob The index rebuild process reads those blobs and pushes them into the search index’s vector fields. The dotted “hash-first delta sync” loop is how I keep backfills from turning into full regenerations: I compare hashes first, then only regenerate what’s actually different. The Embedding Agent: four vectors, one contract The most important thing I did in app/agents/embedding_agent.py was make the contract boring. app/agents/embedding_agent.py It always returns the same typed structure, and it’s explicit about the model and dimensionality: typed """ Embedding Agent - Specialized agent for generating and caching multi-vector embeddings. Generates 4 parallel embeddings for multi-vector search: - profile_vector: Overall candidate profile embedding - experience_vector: Work experience and career history - skills_vector: Skills, designations, and certifications - general_vector: General-purpose embedding for broad matching Uses OpenAI text-embedding-3-large (3072 dimensions) for high-quality embeddings. Implements Redis caching with 24hr TTL to minimize API costs. """ import asyncio import hashlib import json import logging import os import time from dataclasses import dataclass from typing import Any, Dict, List, Optional from openai import AsyncOpenAI from .base import BaseAgent, AgentConfig, AgentResponse, AgentType logger = logging.getLogger(__name__) # Constants EMBEDDING_MODEL = "text-embedding-3-large" EMBEDDING_DIMENSIONS = 3072 # text-embedding-3-large native dimensions # EMBEDDING_TTL_ ... (defined in the source file) """ Embedding Agent - Specialized agent for generating and caching multi-vector embeddings. Generates 4 parallel embeddings for multi-vector search: - profile_vector: Overall candidate profile embedding - experience_vector: Work experience and career history - skills_vector: Skills, designations, and certifications - general_vector: General-purpose embedding for broad matching Uses OpenAI text-embedding-3-large (3072 dimensions) for high-quality embeddings. Implements Redis caching with 24hr TTL to minimize API costs. """ import asyncio import hashlib import json import logging import os import time from dataclasses import dataclass from typing import Any, Dict, List, Optional from openai import AsyncOpenAI from .base import BaseAgent, AgentConfig, AgentResponse, AgentType logger = logging.getLogger(__name__) # Constants EMBEDDING_MODEL = "text-embedding-3-large" EMBEDDING_DIMENSIONS = 3072 # text-embedding-3-large native dimensions # EMBEDDING_TTL_ ... (defined in the source file) What surprised me here is how much reliability comes from naming. Once I had stable names like skills_vector, every downstream system stopped guessing. skills_vector A note on sourcing: this post is grounded in the indexed codebase and retrieved context. Where internals aren’t visible, I describe the contract and design intent rather than inventing implementation details. The contract here is clear enough to design everything else around it. Why the naive approach fails The naive approach is: “generate one embedding for the whole candidate record and call it a day.” That fails in two ways: Semantic dilution: credentials and licenses become a small part of a large text, so similarity isn’t sharp where it matters. Operational coupling: any change forces full regeneration. Semantic dilution: credentials and licenses become a small part of a large text, so similarity isn’t sharp where it matters. Semantic dilution Operational coupling: any change forces full regeneration. Operational coupling Typed vectors fix both: I can query on “skills” when the user cares about designations, and I can regenerate only the vectors whose inputs changed. The tradeoff Multi‑vector increases storage and index complexity. You now have four vectors to store, four to index, and decisions to make at query-time. I accepted that trade because the system already has multiple search modes (vault candidates vs jobs vs notes vs transcripts), and the cost of “wrong matches” is recruiter time. Cache key design: hash field subsets + model id Once you stop thinking of “the embedding” as a single thing, you need a cache key schema that makes collisions hard and invalidation obvious. The retrieved context tells us the Embedding Agent uses: hashlib json Redis caching with a 24hr TTL a model identifier: text-embedding-3-large hashlib hashlib json json Redis caching with a 24hr TTL a model identifier: text-embedding-3-large text-embedding-3-large So the key design I anchored on is: include the model id in the key include the vector type (profile_vector, experience_vector, skills_vector, general_vector) hash the canonical JSON of the input subset that feeds that vector include the model id in the key model id include the vector type (profile_vector, experience_vector, skills_vector, general_vector) vector type profile_vector experience_vector skills_vector general_vector hash the canonical JSON of the input subset that feeds that vector canonical JSON The important engineering property is: the hash must be computed from only the fields that matter for that vector. only the fields that matter for that vector That’s what makes typed vectors worth it: the skills_vector key doesn’t churn when you edit a candidate’s biography, and the experience_vector key doesn’t churn when you add a new designation. skills_vector experience_vector (Operationally, this matches how other teams have versioned embedding artifacts by embedding model id into keys so cached vectors and indexed vectors don’t collide across model rollouts.) [https://www.uber.com/blog/evolution-and-scale-of-ubers-delivery-search-platform/] https://www.uber.com/blog/evolution-and-scale-of-ubers-delivery-search-platform/ Eviction: TTL The Embedding Agent explicitly states Redis caching with 24hr TTL. Redis caching with 24hr TTL The Embedding Agent uses a TTL-based cache that reduces API calls and cost by reusing embeddings for repeat inputs within the 24-hour window. The generator job: retries, validation, and DLQ routing The embedding pipeline isn’t just “call OpenAI four times.” The production problem is: what happens when the input is malformed, too long, or produces the wrong dimensionality? That’s why the generator job exists. The indexed commit summary for app/jobs/embedding_generator.py is explicit: app/jobs/embedding_generator.py “Generate embeddings with retry logic and DLQ routing.” “Handles content length validation and dimension mismatch detection.” “Generate embeddings with retry logic and DLQ routing.” “Handles content length validation and dimension mismatch detection.” Those are the failure modes that actually hurt you during backfills. Why naive batch generation fails If you bulk-generate embeddings without guardrails: one poison record can crash the batch transient API failures stall the pipeline a dimension mismatch can silently corrupt your index one poison record can crash the batch transient API failures stall the pipeline a dimension mismatch can silently corrupt your index So the generator job is designed to be stubborn: validate early, retry transient failures, and route poison records to a dead-letter path. (For practical guidance on designing retries, exponential backoff, idempotency, and DLQ handling in producer jobs, see this engineering writeup that walks through patterns used to keep large-scale pipelines moving while isolating bad records.) [https://softbuilds.medium.com/how-to-design-a-reliable-retry-system-with-backoff-dlqs-idempotency-82c8c001cfde] https://softbuilds.medium.com/how-to-design-a-reliable-retry-system-with-backoff-dlqs-idempotency-82c8c001cfde Code shape (grounded, but not invented) Here’s the interface sketch based on the generator’s documented behavior: """app/jobs/embedding_generator.py Generates embeddings with retry logic and DLQ routing. Handles content length validation and dimension mismatch detection. """ from dataclasses import dataclass from typing import Any, Dict, Optional @dataclass class EmbeddingJobResult: ok: bool error: Optional[str] = None payload: Optional[Dict[str, Any]] = None def embedding_generator_job(record: Dict[str, Any]) -> EmbeddingJobResult: """Run embedding generation for a single record. Real job behavior (from retrieved context): - validates content length - detects dimension mismatch - retries with exponential backoff + jitter - routes poison records to a DLQ This stub returns a structured result without inventing internals. """ return EmbeddingJobResult(ok=False, error="stub — see full implementation in source") if __name__ == "__main__": # Runnable placeholder demonstrating the contract. example = {"id": "record_123", "text": "..."} print(embedding_generator_job(example)) """app/jobs/embedding_generator.py Generates embeddings with retry logic and DLQ routing. Handles content length validation and dimension mismatch detection. """ from dataclasses import dataclass from typing import Any, Dict, Optional @dataclass class EmbeddingJobResult: ok: bool error: Optional[str] = None payload: Optional[Dict[str, Any]] = None def embedding_generator_job(record: Dict[str, Any]) -> EmbeddingJobResult: """Run embedding generation for a single record. Real job behavior (from retrieved context): - validates content length - detects dimension mismatch - retries with exponential backoff + jitter - routes poison records to a DLQ This stub returns a structured result without inventing internals. """ return EmbeddingJobResult(ok=False, error="stub — see full implementation in source") if __name__ == "__main__": # Runnable placeholder demonstrating the contract. example = {"id": "record_123", "text": "..."} print(embedding_generator_job(example)) The thing I like about this job design—based on the summary—is that it treats “dimension mismatch” as a first-class failure. That’s the sort of bug that doesn’t crash loudly; it just makes search feel haunted. Idempotent storage: multi-vector blobs, not scattered fields Once you generate four vectors, you have a choice: store them as separate rows/keys store them as a single blob keyed by record id + model store them as separate rows/keys store them as a single blob keyed by record id + model The system has to support: “idempotent storage of multi-vector blobs” (Author Direction) “dimension mismatch detection” (generator summary) “idempotent storage of multi-vector blobs” (Author Direction) “dimension mismatch detection” (generator summary) So the only safe claim I can make is architectural: I treat the multi-vector output as an atomic unit for persistence, because it prevents partial updates (three vectors updated, one stale) from quietly slipping into index rebuilds. The limitation is obvious: blobs are less queryable. But embeddings are rarely queried directly; they’re read for indexing and search. Query-time composition: why four vectors don’t mean four searches The platform includes a SearchAgent for semantic search with vector embeddings on Azure AI Search, and an Advanced Matching Engine that uses embeddings as one component of a multi-modal scoring algorithm. SearchAgent Advanced Matching Engine Multi-vector embeddings affect query-time in one practical way: the search layer can choose which vector field to use depending on intent. A designation search hits skills_vector. A “find someone like this person” query hits profile_vector. A recruiter scanning for career trajectory hits experience_vector. The same indexed record serves all three without re-embedding. skills_vector profile_vector experience_vector Backfill and reindex: hash-first delta sync The Author Direction calls out “hash-first delta sync” for reconciling inconsistent embeddings. We know the embedding code imports hashlib and json, which are exactly the tools you use to compute stable content hashes. hashlib json The operational trick is simple: compute a hash of the canonical input subset for each vector type compare it to what you last stored only regenerate the vectors whose hashes changed compute a hash of the canonical input subset for each vector type compare it to what you last stored only regenerate the vectors whose hashes changed That’s how you get smaller reindex windows without regenerating everything. smaller reindex windows The presence of hashlib and json imports, plus the multi-vector contract, confirms this is how the system keeps backfills incremental. hashlib json Model variants: swapping model IDs without wrecking caches The Embedding Agent defines the model as a constant: EMBEDDING_MODEL = "text-embedding-3-large" EMBEDDING_MODEL = "text-embedding-3-large" EMBEDDING_MODEL = "text-embedding-3-large" That one string is more important than it looks. If your cache keys include the model id, then trying a model variant becomes operationally safe: new model id → new keys old cache remains valid for the old model you can A/B by selecting which model id to use new model id → new keys old cache remains valid for the old model you can A/B by selecting which model id to use The mechanism—model id in the key—means model rollouts don’t invalidate existing caches or require stop-the-world reindexing. Practical wins A few outcomes are real and immediate from the grounded design: Lower API cost due to caching: the Embedding Agent explicitly implements Redis caching with a 24hr TTL. Smaller reindex windows: multi-vector + hash-first delta sync means you can regenerate only what changed. Safer rollouts of model changes: model id is explicit (text-embedding-3-large), so it can be incorporated into keying and storage. Lower API cost due to caching: the Embedding Agent explicitly implements Redis caching with a 24hr TTL. Lower API cost due to caching Smaller reindex windows: multi-vector + hash-first delta sync means you can regenerate only what changed. Smaller reindex windows Safer rollouts of model changes: model id is explicit (text-embedding-3-large), so it can be incorporated into keying and storage. Safer rollouts of model changes text-embedding-3-large Nuances: where this design bites back Typed vectors are not free. You now have four ways to be wrong instead of one. If you don’t keep the “input subset per vector” disciplined, you’ll churn caches and lose the whole point. If your generator doesn’t treat dimension mismatch as fatal, you can corrupt your index quietly. You now have four ways to be wrong instead of one. If you don’t keep the “input subset per vector” disciplined, you’ll churn caches and lose the whole point. If your generator doesn’t treat dimension mismatch as fatal, you can corrupt your index quietly. That’s why I like the generator job’s explicit focus on validation and poison routing: it’s admitting that production data is messy, and it’s building a system that keeps moving anyway. Closing Once I stopped treating embeddings as a single magical artifact and started treating them as typed, versioned, cacheable facts, the rest of the pipeline stopped fighting me. Cache invalidation became scoped. Reindexing became incremental. Model rollouts became safe. The embedding wasn't the hard part anymore—it was just four facts about a person, each one independently verifiable, each one independently replaceable. That's the design insight that made everything downstream simpler. typed, versioned, cacheable facts