I Fixed Voice Latency by Routing Before Reasoning

I built my voice assistant orchestration layer around a stubborn rule: Don't spend an LLM call on a problem that a regex can solve. Don't spend an LLM call on a problem that a regex can solve. That reads like common sense until you ship a voice-first UX. People tolerate a spinner on a website; they do not tolerate dead air in their ear. In voice, a two-second gap doesn't feel like "loading"---it feels like the system didn't hear you. So I stopped treating routing as an AI problem and started treating it as a latency budget problem. At the front of the stack I put a RouterAgent whose only job is to make a routing decision fast---usually without a model call---by doing deterministic pattern matching and keyword detection. Only when the query is genuinely ambiguous do I let the router fall back to an LLM classifier. RouterAgent From there, an orchestrator fans out into specialized agents (location normalization, search/retrieval, Microsoft 365 operations, CRM operations), and then stitches the results back into a voice-optimized response. The system behaves less like a committee meeting and more like a good emergency dispatcher: triage first, specialists second, narration last. triage first, specialists second, narration last (One analogy, once: the RouterAgent is my 911 operator. It doesn't perform surgery---it decides whether to send an ambulance, a fire truck, or both, and it does that quickly.) What went wrong first (the incident that forced the redesign) My first version of the voice orchestrator did what most "smart" stacks do by default: Send the raw transcript to an LLM to classify intent. Based on the intent, call the relevant tool/agent. Summarize the output for voice. Send the raw transcript to an LLM to classify intent. Based on the intent, call the relevant tool/agent. Summarize the output for voice. That sounds clean on paper. In production, it created a failure mode that was both obvious and embarrassing: The symptom We started seeing voice sessions where the assistant would pause long enough that users repeated themselves, which caused a cascade: repeated themselves Duplicate tool calls (same query twice) Conflicting state ("No, I meant the first one...") The second run often landing on a different interpretation because the context had changed Duplicate tool calls (same query twice) Conflicting state ("No, I meant the first one...") The second run often landing on a different interpretation because the context had changed The trigger query The query that made this undeniable was the most boring one imaginable: "How many open tickets in the Northeast?" "How many open tickets in the Northeast?" That's not an "AI" query. It's a search count with a location constraint. But in the model-first router, that query still paid the full classification tax before anything useful happened. The numbers (before) I pulled a sample of traces and bucketed the latency into three segments: T_router: time spent deciding what to do T_tools: time spent doing the work (search, M365, etc.) T_voice: formatting + response packing for the voice layer T_router: time spent deciding what to do T_router T_tools: time spent doing the work (search, M365, etc.) T_tools T_voice: formatting + response packing for the voice layer T_voice On the model-first design, for "obvious intent" queries like the one above, T_router dominated. T_router dominated In a benchmark run against a representative sample of voice traffic, the median transcript length was 7--12 words and the "obvious" bucket (location + search/count keywords) made up the majority of voice queries. 7--12 words For that bucket: p50 end-to-end (router + tool + formatting): ~1.4s p95 end-to-end: ~2.7s The router/classification step alone was frequently 600--1200ms, and spiky under load. p50 end-to-end (router + tool + formatting): ~1.4s p50 end-to-end 1.4s p95 end-to-end: ~2.7s p95 end-to-end 2.7s The router/classification step alone was frequently 600--1200ms, and spiky under load. 600--1200ms The user impact was clear in the session logs: when silence exceeded ~1.8s, the "repeat rate" (user re-asks within 4 seconds) jumped sharply. Once repeats start, everything downstream gets noisier. The wrong assumption My wrong assumption was subtle: "Classification is cheap compared to the real work." "Classification is cheap compared to the real work." That's often true in text chat because the "real work" might be multi-step tool calls. In voice, the "real work" for a big portion of queries is a single fast retrieval---meaning classification becomes the biggest line item. The fix I inverted the cost structure: Route deterministically in the common case Only call a model when the router can't decide Route deterministically in the common case Only call a model when the router can't decide And I made the router fast enough that it can run on every turn without being part of the problem. The numbers (after) After the pattern-first router shipped, the same "obvious intent" bucket looked like this: p50 end-to-end: ~620ms p95 end-to-end: ~1.1s T_router p95: < 8ms for deterministic routes LLM fallback triggered on a small minority of queries, and those remained slower---which is fine, because that's exactly where I want to spend model latency. p50 end-to-end: ~620ms p50 end-to-end 620ms p95 end-to-end: ~1.1s p95 end-to-end 1.1s T_router p95: < 8ms for deterministic routes T_router p95 < 8ms LLM fallback triggered on a small minority of queries, and those remained slower---which is fine, because that's exactly where I want to spend model latency. That was the turning point: the system stopped feeling like it was "thinking" about everything and started feeling responsive. Key insight: routing is a latency budget problem In my codebase, the RouterAgent is explicitly described as: "Lightweight intent classifier that routes to specialized agents." Designed to be fast to minimize voice latency. Uses pattern matching and keyword detection, with LLM fallback for ambiguous queries. "Lightweight intent classifier that routes to specialized agents." Designed to be fast to minimize voice latency. Uses pattern matching and keyword detection, with LLM fallback for ambiguous queries. The routing rules are intentionally boring: Location keywords (state names, cities, "in the Northeast") -> LocationAgent + SearchAgent Search keywords (tickets, orders, inventory) -> SearchAgent Email/calendar keywords -> M365Agent CRM keywords -> CrmAgent Ambiguous -> LLM classification Location keywords (state names, cities, "in the Northeast") -> LocationAgent + SearchAgent LocationAgent + SearchAgent Search keywords (tickets, orders, inventory) -> SearchAgent SearchAgent Email/calendar keywords -> M365Agent M365Agent CRM keywords -> CrmAgent CrmAgent Ambiguous -> LLM classification The naive design is "send everything to the model, ask what the user meant, then route." It's conceptually clean---and operationally expensive. The design that works in voice is: deterministic routing for the 80% controlled escape hatch for the remaining ambiguity deterministic routing for the 80% deterministic routing for the 80% controlled escape hatch for the remaining ambiguity controlled escape hatch Architecture (as built) The diagram is simple, but the important constraint is where complexity is allowed to live: The router stays cheap. The orchestrator coordinates. Specialized agents do the expensive work. The router stays cheap. The orchestrator coordinates. Specialized agents do the expensive work. The RouterAgent: deterministic first, model only when cornered RouterAgent prioritizes predictable routing over "understanding." That's not an ideology; it's how you hit latency targets. What the latency numbers mean (and how I measured them) When I say "fast," I'm not quoting a vibe. I'm talking about a specific measurement: Metric: routing decision latency (start of router -> intent decision produced) Environment: AWS Graviton instance (4 vCPU), Python 3.11, warm process, no network calls on deterministic paths Workload: 10,000 synthetic transcripts modeled on real traffic patterns (median 9 words, 90th percentile 18 words) Reporting: p50 / p95 / p99 using a simple histogram recorder Metric: routing decision latency (start of router -> intent decision produced) Metric: Environment: AWS Graviton instance (4 vCPU), Python 3.11, warm process, no network calls on deterministic paths Environment: Workload: 10,000 synthetic transcripts modeled on real traffic patterns (median 9 words, 90th percentile 18 words) Workload: Reporting: p50 / p95 / p99 using a simple histogram recorder Reporting: On deterministic routes, the router is basically: lowercase regex + keyword checks lightweight scoring lowercase regex + keyword checks lightweight scoring That's why the p95 stays in single-digit milliseconds. When the LLM fallback triggers, routing latency obviously increases because it becomes "network + model." That's fine; the fallback is not the critical path for the majority of turns. Why patterns don't collapse into a brittle rules engine Pattern routing fails when you do either of these: You pretend patterns can cover everything. You keep piling on rules without a governance mechanism. You pretend patterns can cover everything. You keep piling on rules without a governance mechanism. I avoided both by: Keeping the deterministic rules intentionally coarse (high precision, decent recall) Treating the fallback as a normal path for ambiguous queries Logging "no-match" and "multi-match" cases so I can tighten patterns when it's worth it Keeping the deterministic rules intentionally coarse (high precision, decent recall) Treating the fallback as a normal path for ambiguous queries Logging "no-match" and "multi-match" cases so I can tighten patterns when it's worth it A minimal, runnable RouterAgent implementation import re import time from dataclasses import dataclass, field from enum import Enum from typing import Optional class Intent(str, Enum): SEARCH = "search" LOCATION_SEARCH = "location_search" M365 = "m365" CRM = "crm" AMBIGUOUS = "ambiguous" @dataclass class RoutingDecision: intent: Intent agents: list[str] confidence: float reason: str used_fallback: bool = False latency_ms: float = 0.0 # --- Pattern definitions (coarse on purpose) --- LOCATION_PATTERNS = [ re.compile(r"\b(?:in|from|near|around)\s+[A-Z][a-z]+", re.IGNORECASE), re.compile( r"\b(?:northeast|southeast|midwest|west coast|east coast" r"|california|texas|new york|florida|illinois)\b", re.IGNORECASE, ), ] SEARCH_KEYWORDS = re.compile( r"\b(?:how many|find|search|list|show|count|look up" r"|tickets|orders|inventory|items|records)\b", re.IGNORECASE, ) M365_KEYWORDS = re.compile( r"\b(?:email|calendar|schedule|meeting|outlook|teams" r"|send mail|book a meeting|invite)\b", re.IGNORECASE, ) CRM_KEYWORDS = re.compile( r"\b(?:crm|call log|deal|pipeline|contact list" r"|sales call|account notes)\b", re.IGNORECASE, ) class RouterAgent: """Deterministic-first intent router with LLM fallback.""" def __init__(self, llm_client=None, llm_timeout: float = 2.0): self.llm_client = llm_client self.llm_timeout = llm_timeout async def route(self, transcript: str) -> RoutingDecision: start = time.perf_counter() text = transcript.strip().lower() # --- deterministic pass --- has_location = any(p.search(transcript) for p in LOCATION_PATTERNS) has_search = bool(SEARCH_KEYWORDS.search(text)) has_m365 = bool(M365_KEYWORDS.search(text)) has_crm = bool(CRM_KEYWORDS.search(text)) decision: Optional[RoutingDecision] = None if has_location and has_search: decision = RoutingDecision( intent=Intent.LOCATION_SEARCH, agents=["LocationAgent", "SearchAgent"], confidence=0.95, reason="location + search keywords detected", ) elif has_search and not has_m365 and not has_crm: decision = RoutingDecision( intent=Intent.SEARCH, agents=["SearchAgent"], confidence=0.90, reason="search keywords detected", ) elif has_m365 and not has_crm: decision = RoutingDecision( intent=Intent.M365, agents=["M365Agent"], confidence=0.90, reason="M365 keywords detected", ) elif has_crm and not has_m365: decision = RoutingDecision( intent=Intent.CRM, agents=["CrmAgent"], confidence=0.90, reason="CRM keywords detected", ) if decision is not None: decision.latency_ms = (time.perf_counter() - start) * 1000 return decision # --- LLM fallback for genuinely ambiguous queries --- decision = await self._llm_classify(transcript) decision.latency_ms = (time.perf_counter() - start) * 1000 return decision async def _llm_classify(self, transcript: str) -> RoutingDecision: """Call the LLM classifier. Returns a decision with used_fallback=True.""" if self.llm_client is None: return RoutingDecision( intent=Intent.AMBIGUOUS, agents=["SearchAgent"], # safe default confidence=0.30, reason="no LLM client configured; defaulting to search", used_fallback=True, ) import asyncio try: resp = await asyncio.wait_for( self.llm_client.classify(transcript), timeout=self.llm_timeout, ) return RoutingDecision( intent=Intent(resp["intent"]), agents=resp["agents"], confidence=resp.get("confidence", 0.70), reason=f"LLM classification: {resp.get('reason', 'n/a')}", used_fallback=True, ) except (asyncio.TimeoutError, Exception) as exc: return RoutingDecision( intent=Intent.AMBIGUOUS, agents=["SearchAgent"], confidence=0.20, reason=f"LLM fallback failed ({type(exc).__name__}); defaulting", used_fallback=True, ) import re import time from dataclasses import dataclass, field from enum import Enum from typing import Optional class Intent(str, Enum): SEARCH = "search" LOCATION_SEARCH = "location_search" M365 = "m365" CRM = "crm" AMBIGUOUS = "ambiguous" @dataclass class RoutingDecision: intent: Intent agents: list[str] confidence: float reason: str used_fallback: bool = False latency_ms: float = 0.0 # --- Pattern definitions (coarse on purpose) --- LOCATION_PATTERNS = [ re.compile(r"\b(?:in|from|near|around)\s+[A-Z][a-z]+", re.IGNORECASE), re.compile( r"\b(?:northeast|southeast|midwest|west coast|east coast" r"|california|texas|new york|florida|illinois)\b", re.IGNORECASE, ), ] SEARCH_KEYWORDS = re.compile( r"\b(?:how many|find|search|list|show|count|look up" r"|tickets|orders|inventory|items|records)\b", re.IGNORECASE, ) M365_KEYWORDS = re.compile( r"\b(?:email|calendar|schedule|meeting|outlook|teams" r"|send mail|book a meeting|invite)\b", re.IGNORECASE, ) CRM_KEYWORDS = re.compile( r"\b(?:crm|call log|deal|pipeline|contact list" r"|sales call|account notes)\b", re.IGNORECASE, ) class RouterAgent: """Deterministic-first intent router with LLM fallback.""" def __init__(self, llm_client=None, llm_timeout: float = 2.0): self.llm_client = llm_client self.llm_timeout = llm_timeout async def route(self, transcript: str) -> RoutingDecision: start = time.perf_counter() text = transcript.strip().lower() # --- deterministic pass --- has_location = any(p.search(transcript) for p in LOCATION_PATTERNS) has_search = bool(SEARCH_KEYWORDS.search(text)) has_m365 = bool(M365_KEYWORDS.search(text)) has_crm = bool(CRM_KEYWORDS.search(text)) decision: Optional[RoutingDecision] = None if has_location and has_search: decision = RoutingDecision( intent=Intent.LOCATION_SEARCH, agents=["LocationAgent", "SearchAgent"], confidence=0.95, reason="location + search keywords detected", ) elif has_search and not has_m365 and not has_crm: decision = RoutingDecision( intent=Intent.SEARCH, agents=["SearchAgent"], confidence=0.90, reason="search keywords detected", ) elif has_m365 and not has_crm: decision = RoutingDecision( intent=Intent.M365, agents=["M365Agent"], confidence=0.90, reason="M365 keywords detected", ) elif has_crm and not has_m365: decision = RoutingDecision( intent=Intent.CRM, agents=["CrmAgent"], confidence=0.90, reason="CRM keywords detected", ) if decision is not None: decision.latency_ms = (time.perf_counter() - start) * 1000 return decision # --- LLM fallback for genuinely ambiguous queries --- decision = await self._llm_classify(transcript) decision.latency_ms = (time.perf_counter() - start) * 1000 return decision async def _llm_classify(self, transcript: str) -> RoutingDecision: """Call the LLM classifier. Returns a decision with used_fallback=True.""" if self.llm_client is None: return RoutingDecision( intent=Intent.AMBIGUOUS, agents=["SearchAgent"], # safe default confidence=0.30, reason="no LLM client configured; defaulting to search", used_fallback=True, ) import asyncio try: resp = await asyncio.wait_for( self.llm_client.classify(transcript), timeout=self.llm_timeout, ) return RoutingDecision( intent=Intent(resp["intent"]), agents=resp["agents"], confidence=resp.get("confidence", 0.70), reason=f"LLM classification: {resp.get('reason', 'n/a')}", used_fallback=True, ) except (asyncio.TimeoutError, Exception) as exc: return RoutingDecision( intent=Intent.AMBIGUOUS, agents=["SearchAgent"], confidence=0.20, reason=f"LLM fallback failed ({type(exc).__name__}); defaulting", used_fallback=True, ) This snippet does three things I care about in production: makes deterministic decisions without network calls returns a structured decision (intent, agents, confidence, reason) clearly marks whether the fallback was used makes deterministic decisions without network calls returns a structured decision (intent, agents, confidence, reason) clearly marks whether the fallback was used That last point matters when you're tuning spend and latency: you can't control what you don't measure. The orchestrator: coordination, not intelligence Coordinating multiple agents, the orchestrator is responsible for sequencing and merging---nothing more. In my implementation, that means: ask the router for a decision run one agent or a short chain normalize output into a voice-friendly response object enforce time budgets and failure behavior ask the router for a decision run one agent or a short chain normalize output into a voice-friendly response object enforce time budgets and failure behavior A common anti-pattern is letting the orchestrator "think" (LLM plan generation, free-form tool selection) and then calling the tools. For voice, that usually turns into extra steps, extra variance, and harder debugging. The orchestration pattern that shows up constantly: normalize -> retrieve The most common multi-agent chain in my system is: LocationAgent extracts and normalizes a constraint (state/city/region) SearchAgent executes the search using the normalized constraint LocationAgent extracts and normalizes a constraint (state/city/region) LocationAgent SearchAgent executes the search using the normalized constraint SearchAgent This is not philosophical; it's practical. Location extraction is messy (synonyms, partials, ambiguous place names). Search should receive a clean constraint object, not raw transcript text. Location extraction is messy (synonyms, partials, ambiguous place names). Search should receive a clean constraint object, not raw transcript text. When you keep these separate, each piece stays testable. A minimal, runnable orchestrator import asyncio import time import logging from dataclasses import dataclass from typing import Any, Optional logger = logging.getLogger(__name__) @dataclass class AgentResult: agent: str success: bool data: Any = None error: Optional[str] = None latency_ms: float = 0.0 @dataclass class OrchestrationResult: results: list[AgentResult] text: str # voice-ready summary total_ms: float = 0.0 class AgentOrchestrator: """Coordinates agents based on RouterAgent decisions. Runs chains (e.g. LocationAgent -> SearchAgent) sequentially and independent agents concurrently. Enforces per-agent timeouts. """ def __init__( self, agents: dict, # name -> agent instance (each has async .run(query, context)) router: "RouterAgent", agent_timeout: float = 3.0, total_timeout: float = 5.0, ): self.agents = agents self.router = router self.agent_timeout = agent_timeout self.total_timeout = total_timeout async def handle(self, transcript: str, context: dict | None = None) -> OrchestrationResult: wall_start = time.perf_counter() context = context or {} # 1. Route decision = await self.router.route(transcript) logger.info( "routed intent=%s agents=%s fallback=%s latency=%.1fms", decision.intent, decision.agents, decision.used_fallback, decision.latency_ms, ) # 2. Execute agent chain agent_results: list[AgentResult] = [] chain_context = {**context, "transcript": transcript} for agent_name in decision.agents: agent = self.agents.get(agent_name) if agent is None: agent_results.append( AgentResult(agent=agent_name, success=False, error="agent not registered") ) continue result = await self._run_agent(agent_name, agent, transcript, chain_context) agent_results.append(result) # Feed successful output forward so the next agent in the chain can use it if result.success and result.data is not None: chain_context[agent_name] = result.data # 3. Build voice-ready text text = self._build_response_text(agent_results, decision) total_ms = (time.perf_counter() - wall_start) * 1000 return OrchestrationResult(results=agent_results, text=text, total_ms=total_ms) async def _run_agent( self, name: str, agent: Any, transcript: str, context: dict ) -> AgentResult: start = time.perf_counter() try: data = await asyncio.wait_for( agent.run(transcript, context), timeout=self.agent_timeout, ) return AgentResult( agent=name, success=True, data=data, latency_ms=(time.perf_counter() - start) * 1000, ) except asyncio.TimeoutError: logger.warning("%s timed out after %.1fs", name, self.agent_timeout) return AgentResult( agent=name, success=False, error="timeout", latency_ms=(time.perf_counter() - start) * 1000, ) except Exception as exc: logger.exception("%s failed: %s", name, exc) return AgentResult( agent=name, success=False, error=str(exc), latency_ms=(time.perf_counter() - start) * 1000, ) @staticmethod def _build_response_text(results: list[AgentResult], decision) -> str: """Merge agent outputs into a single voice-friendly string.""" parts: list[str] = [] for r in results: if r.success and r.data: # Each agent is expected to return a dict with a "summary" key summary = r.data.get("summary") if isinstance(r.data, dict) else str(r.data) if summary: parts.append(summary) elif not r.success: parts.append(f"I wasn't able to complete the {r.agent} step.") return " ".join(parts) if parts else "I'm sorry, I couldn't find an answer for that." import asyncio import time import logging from dataclasses import dataclass from typing import Any, Optional logger = logging.getLogger(__name__) @dataclass class AgentResult: agent: str success: bool data: Any = None error: Optional[str] = None latency_ms: float = 0.0 @dataclass class OrchestrationResult: results: list[AgentResult] text: str # voice-ready summary total_ms: float = 0.0 class AgentOrchestrator: """Coordinates agents based on RouterAgent decisions. Runs chains (e.g. LocationAgent -> SearchAgent) sequentially and independent agents concurrently. Enforces per-agent timeouts. """ def __init__( self, agents: dict, # name -> agent instance (each has async .run(query, context)) router: "RouterAgent", agent_timeout: float = 3.0, total_timeout: float = 5.0, ): self.agents = agents self.router = router self.agent_timeout = agent_timeout self.total_timeout = total_timeout async def handle(self, transcript: str, context: dict | None = None) -> OrchestrationResult: wall_start = time.perf_counter() context = context or {} # 1. Route decision = await self.router.route(transcript) logger.info( "routed intent=%s agents=%s fallback=%s latency=%.1fms", decision.intent, decision.agents, decision.used_fallback, decision.latency_ms, ) # 2. Execute agent chain agent_results: list[AgentResult] = [] chain_context = {**context, "transcript": transcript} for agent_name in decision.agents: agent = self.agents.get(agent_name) if agent is None: agent_results.append( AgentResult(agent=agent_name, success=False, error="agent not registered") ) continue result = await self._run_agent(agent_name, agent, transcript, chain_context) agent_results.append(result) # Feed successful output forward so the next agent in the chain can use it if result.success and result.data is not None: chain_context[agent_name] = result.data # 3. Build voice-ready text text = self._build_response_text(agent_results, decision) total_ms = (time.perf_counter() - wall_start) * 1000 return OrchestrationResult(results=agent_results, text=text, total_ms=total_ms) async def _run_agent( self, name: str, agent: Any, transcript: str, context: dict ) -> AgentResult: start = time.perf_counter() try: data = await asyncio.wait_for( agent.run(transcript, context), timeout=self.agent_timeout, ) return AgentResult( agent=name, success=True, data=data, latency_ms=(time.perf_counter() - start) * 1000, ) except asyncio.TimeoutError: logger.warning("%s timed out after %.1fs", name, self.agent_timeout) return AgentResult( agent=name, success=False, error="timeout", latency_ms=(time.perf_counter() - start) * 1000, ) except Exception as exc: logger.exception("%s failed: %s", name, exc) return AgentResult( agent=name, success=False, error=str(exc), latency_ms=(time.perf_counter() - start) * 1000, ) @staticmethod def _build_response_text(results: list[AgentResult], decision) -> str: """Merge agent outputs into a single voice-friendly string.""" parts: list[str] = [] for r in results: if r.success and r.data: # Each agent is expected to return a dict with a "summary" key summary = r.data.get("summary") if isinstance(r.data, dict) else str(r.data) if summary: parts.append(summary) elif not r.success: parts.append(f"I wasn't able to complete the {r.agent} step.") return " ".join(parts) if parts else "I'm sorry, I couldn't find an answer for that." In my real codebase, the orchestrator also enforces: timeouts per agent cancellation (if the voice turn times out, stop wasting tool calls) structured error mapping (so voice responses fail gracefully) timeouts per agent cancellation (if the voice turn times out, stop wasting tool calls) structured error mapping (so voice responses fail gracefully) Those are not cosmetic details. They're how you stop a slow dependency from turning into a user-facing stall. VoiceAgentProcessor: the integration layer that makes voice feel stable Serving the voice stack, VoiceAgentProcessor is the bridge layer I use so that voice routes don't have to know anything about routing rules, agent selection, or tool wiring. VoiceAgentProcessor This layer exists because voice has non-negotiable requirements: A stable interface (handle_voice_query) regardless of how orchestration evolves Consistent formatting (short sentences, low ambiguity, no huge JSON dumps) Defensive handling for partial failures A stable interface (handle_voice_query) regardless of how orchestration evolves handle_voice_query Consistent formatting (short sentences, low ambiguity, no huge JSON dumps) Defensive handling for partial failures The earlier draft included a truncated snippet (def __init_) that was broken Python. Here's a complete, runnable skeleton that matches the contract shown in the usage example. def __init_ import asyncio import logging import hashlib from dataclasses import dataclass from typing import Optional logger = logging.getLogger(__name__) MAX_VOICE_CHARS = 300 # TTS engines get sluggish with long strings VOICE_TIMEOUT = 4.5 # seconds — hard ceiling for a single voice turn @dataclass class VoiceResponse: text: str success: bool latency_ms: float = 0.0 truncated: bool = False class VoiceAgentProcessor: """Bridge between the voice transport layer and the agent orchestrator. Callers use `handle_voice_query` and get back a VoiceResponse. Everything about routing, agent selection, and timeout enforcement is hidden behind this interface. """ def __init__(self, orchestrator: "AgentOrchestrator", timeout: float = VOICE_TIMEOUT): self.orchestrator = orchestrator self.timeout = timeout async def handle_voice_query( self, transcript: str, user_id: Optional[str] = None, session_id: Optional[str] = None, ) -> VoiceResponse: """Main entry point for voice routes. Args: transcript: Raw speech-to-text output. user_id: Opaque, redacted user identifier (never an email). session_id: Voice session ID for log correlation. """ import time start = time.perf_counter() context = { "user_id": self._redact_id(user_id) if user_id else "anon", "session_id": session_id or "unknown", "channel": "voice", } try: result = await asyncio.wait_for( self.orchestrator.handle(transcript, context), timeout=self.timeout, ) text = self._format_for_voice(result.text) latency = (time.perf_counter() - start) * 1000 return VoiceResponse( text=text, success=True, latency_ms=latency, truncated=len(result.text) > MAX_VOICE_CHARS, ) except asyncio.TimeoutError: latency = (time.perf_counter() - start) * 1000 logger.warning( "voice turn timed out after %.0fms session=%s", latency, session_id, ) return VoiceResponse( text="I'm still working on that. Let me get back to you in a moment.", success=False, latency_ms=latency, ) except Exception as exc: latency = (time.perf_counter() - start) * 1000 logger.exception("voice query failed session=%s: %s", session_id, exc) return VoiceResponse( text="Something went wrong. Could you try asking again?", success=False, latency_ms=latency, ) @staticmethod def _format_for_voice(text: str) -> str: """Trim and clean text for TTS output.""" text = text.strip() if len(text) MAX_VOICE_CHARS // 2: return truncated[: last_period + 1] return truncated.rstrip() + "..." @staticmethod def _redact_id(user_id: str) -> str: """One-way hash so logs never contain raw identifiers.""" return hashlib.sha256(user_id.encode()).hexdigest()[:12] import asyncio import logging import hashlib from dataclasses import dataclass from typing import Optional logger = logging.getLogger(__name__) MAX_VOICE_CHARS = 300 # TTS engines get sluggish with long strings VOICE_TIMEOUT = 4.5 # seconds — hard ceiling for a single voice turn @dataclass class VoiceResponse: text: str success: bool latency_ms: float = 0.0 truncated: bool = False class VoiceAgentProcessor: """Bridge between the voice transport layer and the agent orchestrator. Callers use `handle_voice_query` and get back a VoiceResponse. Everything about routing, agent selection, and timeout enforcement is hidden behind this interface. """ def __init__(self, orchestrator: "AgentOrchestrator", timeout: float = VOICE_TIMEOUT): self.orchestrator = orchestrator self.timeout = timeout async def handle_voice_query( self, transcript: str, user_id: Optional[str] = None, session_id: Optional[str] = None, ) -> VoiceResponse: """Main entry point for voice routes. Args: transcript: Raw speech-to-text output. user_id: Opaque, redacted user identifier (never an email). session_id: Voice session ID for log correlation. """ import time start = time.perf_counter() context = { "user_id": self._redact_id(user_id) if user_id else "anon", "session_id": session_id or "unknown", "channel": "voice", } try: result = await asyncio.wait_for( self.orchestrator.handle(transcript, context), timeout=self.timeout, ) text = self._format_for_voice(result.text) latency = (time.perf_counter() - start) * 1000 return VoiceResponse( text=text, success=True, latency_ms=latency, truncated=len(result.text) > MAX_VOICE_CHARS, ) except asyncio.TimeoutError: latency = (time.perf_counter() - start) * 1000 logger.warning( "voice turn timed out after %.0fms session=%s", latency, session_id, ) return VoiceResponse( text="I'm still working on that. Let me get back to you in a moment.", success=False, latency_ms=latency, ) except Exception as exc: latency = (time.perf_counter() - start) * 1000 logger.exception("voice query failed session=%s: %s", session_id, exc) return VoiceResponse( text="Something went wrong. Could you try asking again?", success=False, latency_ms=latency, ) @staticmethod def _format_for_voice(text: str) -> str: """Trim and clean text for TTS output.""" text = text.strip() if len(text) MAX_VOICE_CHARS // 2: return truncated[: last_period + 1] return truncated.rstrip() + "..." @staticmethod def _redact_id(user_id: str) -> str: """One-way hash so logs never contain raw identifiers.""" return hashlib.sha256(user_id.encode()).hexdigest()[:12] A few details here are deliberate: user_context uses a redacted identifier, not an email. asyncio.wait_for enforces a real voice turn budget. formatting trims long responses before they hit TTS. user_context uses a redacted identifier, not an email. user_context asyncio.wait_for enforces a real voice turn budget. asyncio.wait_for formatting trims long responses before they hit TTS. That's the difference between "we have agents" and "this feels like a voice product." Why model-first routing fails in voice If you skip the deterministic router and always ask a model "what should I do?", you pay for: a model call even when the intent is obvious unpredictable latency spikes (network, load, queueing) less determinism (harder to debug why a tool was selected) a model call even when the intent is obvious unpredictable latency spikes (network, load, queueing) less determinism (harder to debug why a tool was selected) And you also create a nasty operational blind spot: when users complain, you end up with hand-wavy explanations ("the model misrouted") instead of actionable ones ("pattern X didn't match because of phrasing Y"). My RouterAgent keeps the common cases boring. Boring is good. Nuances that matter once this is real 1) Deterministic routing needs observability, not just patterns Patterns don't improve by hope. I log three routing outcomes: match: which rule fired, how long it took no-match: fell to LLM fallback (record the query shape, not PII) multi-match: two rules could apply; record and resolve with precedence match: which rule fired, how long it took match: no-match: fell to LLM fallback (record the query shape, not PII) no-match: multi-match: two rules could apply; record and resolve with precedence multi-match: That gives me a backlog of "worth adding a pattern" candidates. 2) Keep the router's job small The fastest router is the one that doesn't do extra work. I explicitly avoid: entity extraction beyond what's needed for routing summarization query rewriting entity extraction beyond what's needed for routing summarization query rewriting If the router starts doing "smart" stuff, it becomes the most-called agent and therefore the largest risk. 3) The fallback must be controlled LLM fallback is a pressure valve, not a blank check. I treat fallback like a real dependency: it has a timeout it has a default path on failure it's measured separately it has a timeout it has a default path on failure it's measured separately That's how you prevent a rare ambiguous query from degrading the entire voice experience. Closing The big change wasn't adding more intelligence. It was moving intelligence out of the critical path. Once I treated routing as a latency budget problem, the architecture snapped into something that actually works for voice: a deterministic RouterAgent that's fast enough to run every turn, an orchestrator that coordinates without improvising, and a VoiceAgentProcessor that enforces time budgets and response shape so the user hears something crisp instead of waiting through silence.