The AI Control Plane: Observability, Cost Governance, and Policy for LLM Applications

Yet another “control plane”! Why? Yet another “control plane”! Why? Anyone who has built or managed platforms at scale must be familiar with control planes, but maybe not in the context of AI. Typical control planes that you might know about: Kubernetes control plane reconciles desired state and enforces policies.The service mesh control plane distributes traffic and telemetry configuration.API gateway control plane manages auth, quotas, routing, rate limits. Kubernetes control plane reconciles desired state and enforces policies. The service mesh control plane distributes traffic and telemetry configuration. API gateway control plane manages auth, quotas, routing, rate limits. LLM/Agentic/AI applications need the same idea because the ‘request’ isn't only an HTTP call. It’s prompt, often with retrieved context, a chain of tool calls, and non-deterministic outputs – plus security risks like prompt injection and sensitive data exfiltration. So, treat prompts, model calls, retrieval, and tools as first-class production components that are instrumented and governed like microservices and data. What is the AI Control Plane? What is the AI Control Plane? The AI control plane is a shared platform layer that sits on the execution path of your LLM workloads to provide: Observability: traces, metrics, logs for model and tool callsQuality controls: evals, canaries, regression tests, drift detectionPolicy enforcement: data/PII rules, tool permissions, safety filters, schema checksCost management: budgets, attribution, rate limits, token quotas, showback/chargebackOperational safety: circuit breakers for agent loops, caching, fallbacks Observability: traces, metrics, logs for model and tool calls Quality controls: evals, canaries, regression tests, drift detection Policy enforcement: data/PII rules, tool permissions, safety filters, schema checks Cost management: budgets, attribution, rate limits, token quotas, showback/chargeback Operational safety: circuit breakers for agent loops, caching, fallbacks What is it not: your app logic. If your “control plane” contains app decision logic, it becomes a bottleneck. Keep it thin, consistent and ubiquitous. What is it not: your app logic. If your “control plane” contains app decision logic, it becomes a bottleneck. Keep it thin, consistent and ubiquitous. What is it not Start with a request “envelope”/metadata (the minimum contract) Start with a request “envelope”/metadata (the minimum contract) Similar to a standard header set for your microservices, define what every LLM interaction must carry. Example header fields: trace_id / span_id (distributed tracing correlation)tenant_id and user_id (ideally hashed)prompt_id + prompt_versionpolicy_profile (which rule set applies)model_route (provider + model + fallback strategy)data_classification(public/internal/confidential/restricted)budget_key (cost attribution label, e.g., team=payments,feature=claims_assistant)environment (dev/stage/prod) + release_version trace_id / span_id (distributed tracing correlation) tenant_id and user_id (ideally hashed) prompt_id + prompt_version policy_profile (which rule set applies) model_route (provider + model + fallback strategy) data_classification (public/internal/confidential/restricted) budget_key (cost attribution label, e.g., team=payments, feature=claims_assistant) environment (dev/stage/prod) + release_version Example header: { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "tenant_id": "acme-co", "user_id_hash": "u_8f3b…", "conversation_id": "conv_2026_01_18_001", "prompt_id": "support_reply", "prompt_version": "v17", "policy_profile": "support-prod", "model_route": "primary:gpt-4.x fallback:gpt-4o-mini", "data_classification": "confidential", "budget_key": "team:support feature:assistant", "env":"prod", "release_version": "2026.01.18" } { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "tenant_id": "acme-co", "user_id_hash": "u_8f3b…", "conversation_id": "conv_2026_01_18_001", "prompt_id": "support_reply", "prompt_version": "v17", "policy_profile": "support-prod", "model_route": "primary:gpt-4.x fallback:gpt-4o-mini", "data_classification": "confidential", "budget_key": "team:support feature:assistant", "env":"prod", "release_version": "2026.01.18" } This “envelope”/”header” becomes the unique key across traces, evals, incidents, and costs. Observability: instrument prompts like microservices Observability: instrument prompts like microservices Distributed tracing: every model call is a span, and every tool is a child span LLM app is a distributed system: App > orchestrator/agent > retrieval > model > tool(s) > model > response App > orchestrator/agent > retrieval > model > tool(s) > model > response So instrument it with the same discipline: one root span for the user requestnested spans for:retrieval (vector DB/search/MCP)model inferencetool invocations (CRM lookup, ticket creation, payments, etc.)Safety classifiers and validators one root span for the user request nested spans for: retrieval (vector DB/search/MCP) model inference tool invocations (CRM lookup, ticket creation, payments, etc.) Safety classifiers and validators Use standard semantics where possible, and for cross-service correlation, propagate trace context headers. 1) Practical span attributes you’ll actually use 1) Practical span attributes you’ll actually use gen_ai.operation.name (e.g., chat, embeddings)gen_ai.provider.namegen_ai.request.modelgen_ai.conversation.idgen_ai.usage.input_tokens,gen_ai.usage.output_tokensgen_ai.request.temperature, gen_ai.request.max_tokens (when applicable)tool.name, tool.result_status, tool.latency_msprompt.id, prompt.versionpolicy.decision (allow/deny/redact)budget.key, budget.remaining gen_ai.operation.name (e.g., chat, embeddings) gen_ai.provider.name gen_ai.request.model gen_ai.conversation.id gen_ai.usage.input_tokens, gen_ai.usage.output_tokens gen_ai.request.temperature, gen_ai.request.max_tokens (when applicable) tool.name, tool.result_status, tool.latency_ms prompt.id, prompt.version policy.decision (allow/deny/redact) budget.key, budget.remaining Note: content capture (full propmpts/responses) is a high risk. Note: content capture (full propmpts/responses) is a high risk. Code Example: wrapping an LLM call with OpenTelemetry (python pseudocode) Code Example: wrapping an LLM call with OpenTelemetry (python pseudocode) import time from opentelemetry import trace, metrics tracer = trace.get_tracer("ai.app") meter = metrics.get_meter("ai.app") token_usage = meter.create_histogram("gen_ai.client.token.usage", unit="{token}") op_duration = meter.create_histogram("gen_ai.client.operation.duration", unit="s") def call_llm(envelope, model, messages, temperature=0.2, max_tokens=600): start = time.perf_counter() span_name = f"chat {model}" with tracer.start_as_current_span(span_name) as span: span.set_attribute("gen_ai.operation.name", "chat") span.set_attribute("gen_ai.provider.name", envelope["model_route"].split()[0].split(":")[1]) span.set_attribute("gen_ai.request.model", model) span.set_attribute("gen_ai.conversation.id", envelope["conversation_id"]) span.set_attribute("prompt.id", envelope["prompt_id"]) span.set_attribute("prompt.version", envelope["prompt_version"]) span.set_attribute("budget.key", envelope["budget_key"]) # DO NOT store raw messages/content by default. resp = llm_client.chat (messages=messages, temperature=temperature, max_tokens=max_tokens) dur = time.perf_counter() - start span.set_attribute ("gen_ai.usage.input_tokens", resp.usage.input_tokens) span.set_attribute ("gen_ai.usage.output_tokens", resp.usage.output_tokens) # Metrics aligned with GenAI conventions attrs = { "gen_ai.operation.name": "chat", "gen_ai.provider.name":"openai", "gen_ai.request.model":model } op_duration.record(dur, attributes=attrs) token_usage.record(resp.usage.input_tokens, attributes={**attrs, "gen_ai.token.type": "input"}) token_usage.record(resp.usage.output_tokens, attributes={**attrs, "gen_ai.token.type": "output"}) return resp.output_text import time from opentelemetry import trace, metrics tracer = trace.get_tracer("ai.app") meter = metrics.get_meter("ai.app") token_usage = meter.create_histogram("gen_ai.client.token.usage", unit="{token}") op_duration = meter.create_histogram("gen_ai.client.operation.duration", unit="s") def call_llm(envelope, model, messages, temperature=0.2, max_tokens=600): start = time.perf_counter() span_name = f"chat {model}" with tracer.start_as_current_span(span_name) as span: span.set_attribute("gen_ai.operation.name", "chat") span.set_attribute("gen_ai.provider.name", envelope["model_route"].split()[0].split(":")[1]) span.set_attribute("gen_ai.request.model", model) span.set_attribute("gen_ai.conversation.id", envelope["conversation_id"]) span.set_attribute("prompt.id", envelope["prompt_id"]) span.set_attribute("prompt.version", envelope["prompt_version"]) span.set_attribute("budget.key", envelope["budget_key"]) # DO NOT store raw messages/content by default. resp = llm_client.chat (messages=messages, temperature=temperature, max_tokens=max_tokens) dur = time.perf_counter() - start span.set_attribute ("gen_ai.usage.input_tokens", resp.usage.input_tokens) span.set_attribute ("gen_ai.usage.output_tokens", resp.usage.output_tokens) # Metrics aligned with GenAI conventions attrs = { "gen_ai.operation.name": "chat", "gen_ai.provider.name":"openai", "gen_ai.request.model":model } op_duration.record(dur, attributes=attrs) token_usage.record(resp.usage.input_tokens, attributes={**attrs, "gen_ai.token.type": "input"}) token_usage.record(resp.usage.output_tokens, attributes={**attrs, "gen_ai.token.type": "output"}) return resp.output_text 2) Logging: avoid the raw text but keep the ‘why' 2) Logging: avoid the raw text but keep the ‘why' Traditional logs are usually safe because requests are mostly structured and expected. LLM logs are a bit different. Prompts can contain: Customer messagesInternal docsCredentials pasted by usersPII data Customer messages Internal docs Credentials pasted by users PII data So the control plane should support tiered logging: tiered logging Tier A (always on, safe metadata) Tier A (always on, safe metadata) Prompt hash, prompt versionModel & parametersToken counts, latencyTool list & statusPolicy decisionsEvaluation scoreError codes Prompt hash, prompt version Model & parameters Token counts, latency Tool list & status Policy decisions Evaluation score Error codes Tier B (redacted/samples, controlled by incident workflow) Tier B (redacted/samples, controlled by incident workflow) Redacted prompt & response snippetsTool arguments with sensitive fields maskedCaptured only for selected tenant/users, time windows, or debugging sessions Redacted prompt & response snippets Tool arguments with sensitive fields masked Captured only for selected tenant/users, time windows, or debugging sessions Tier C (never in central logs) Tier C (never in central logs) Raw unredacted content unless you have explicit legal/infosec approval and strong access controls Raw unredacted content unless you have explicit legal/infosec approval and strong access controls PII redaction: build a “PII firewall” (and don’t trust it blindly) PII redaction should happen in two places: Before the model call (don’t send what you shouldn’t)Before persistence (don’t store what you shouldn’t) Before the model call (don’t send what you shouldn’t) Before the model call Before persistence (don’t store what you shouldn’t) Before persistence Some tools (e.g., Microsoft Presidio) can detect/anonymize PII, but there is no guarantee it will catch everything; hence, you still need layered protections. A good PII redaction pipeline would look like: A good PII redaction pipeline would look like: classify content (public/internal/confidential)detect PII (pattern + NLP + allowlist/denylist)apply transformation:redact ([redacted_email])mask (j***@example.com)tokenize/pseudonymize (stable per tenant if needed)emit a redaction report into a trace such as:which entities were foundwhich policy was appliedconfidence scorestore only the redacted form in logs classify content (public/internal/confidential) detect PII (pattern + NLP + allowlist/denylist) apply transformation: redact ([redacted_email]) mask (j***@example.com) j***@example.com tokenize/pseudonymize (stable per tenant if needed) emit a redaction report into a trace such as: which entities were found which policy was applied confidence score store only the redacted form in logs Conceptual example: # Pseudocode only entities = pii_detector.detect (text) if policy.requires_redaction (entities): text = pii_detector.redact(text) span.set_attribute("policy.pii.redacted", True) span.set_attribute("policy.pii.types", [e.type for e in entities]) # Pseudocode only entities = pii_detector.detect (text) if policy.requires_redaction (entities): text = pii_detector.redact(text) span.set_attribute("policy.pii.redacted", True) span.set_attribute("policy.pii.types", [e.type for e in entities]) Policy enforcement: guardrails must be coded and not rely on “a better prompt.” Prompts can help, but prompts are not enforcement. LLMs are inherently probabilistic, and prompts are suggestions to a probabilistic system. A control plane enforces policy at Policy Decision Points (PDPs) such as: Ingress: before the prompt enters the orchestratorPre-model: before calling the LLM providerTool execution: before invoking a sensitive integrationEgress: before returning output to the userPersistence: before saving transcripts, embeddings, or traces Ingress: before the prompt enters the orchestrator Ingress Pre-model: before calling the LLM provider Pre-model Tool execution: before invoking a sensitive integration Tool execution Egress: before returning output to the user Egress Persistence: before saving transcripts, embeddings, or traces Persistence Enterprise grade set of policies that are worth enforcing Enterprise grade set of policies that are worth enforcing Data policies Data policies PII/PCI/PHI detection and transformation“No secrets to model” (API keys, credentials)retrieval access control (RBAC/ABAC on documents) PII/PCI/PHI detection and transformation “No secrets to model” (API keys, credentials) retrieval access control (RBAC/ABAC on documents) Tool policies Tool policies tool allowlists per role/tenantargument validation (schema & allowlists)read vs. write separation (e.g., ticket “create” requires higher privilege than “search”) tool allowlists per role/tenant argument validation (schema & allowlists) read vs. write separation (e.g., ticket “create” requires higher privilege than “search”) Output policies Output policies JSON schema validationcitation/grounding requirement for high-stakes answerssafety filters and disallowed content checks JSON schema validation citation/grounding requirement for high-stakes answers safety filters and disallowed content checks Rate limiting: tokens are the new performance currency Rate limiting: tokens are the new performance currency In microservices, we rate-limit requests. LLM apps must rate-limit following: requeststokenstool callsagent stepswall-clock time per conversation requests requests tokens tokens tool calls tool calls agent steps agent steps wall-clock time per conversation wall-clock time per conversation If you only rate-limit requests, an attacker (or a buggy agent) can still burn budget by generating huge outputs or looping tool calls. You should enforce: You should enforce: max tokens per minute per tenantmax concurrent in-flight LLM calls per tenantmax tool calls per requestmax agent stepsmax total tokens per conversationcircuit breaker on repeated failure modes (timeouts, tool errors, policy denials) max tokens per minute per tenant max concurrent in-flight LLM calls per tenant max tool calls per request max agent steps max total tokens per conversation circuit breaker on repeated failure modes (timeouts, tool errors, policy denials) FinOps for LLM apps: LLM calls are expensive, cost management isn’t optional FinOps for LLM apps: LLM calls are expensive, cost management isn’t optional FinOps is an operating model that creates financial accountability through collaboration between engineering, finance, and business teams. FinOps is extensively used in cloud context. However, that definition applies perfectly to LLM applications because costs scale with: traffic volumeprompt size (context window)tool loopsmodel choiceretries and fallbacks traffic volume prompt size (context window) tool loops model choice retries and fallbacks The control plane makes cost a first-class signal The control plane makes cost a first-class signal If telemetry captures budget_key, prompt_version, model, and token usage, you can do what mature cloud FinOps teams do: showback: “who spent what?”unit cost: cost per ticket resolved, cost per claim processed, cost per onboardingbudget guardrails: block or degrade when budgets hit thresholdscost anomaly alerts: “token usage per session is up 70% vs. baseline” showback: “who spent what?” showback unit cost: cost per ticket resolved, cost per claim processed, cost per onboarding unit cost budget guardrails: block or degrade when budgets hit thresholds budget guardrails cost anomaly alerts: “token usage per session is up 70% vs. baseline” cost anomaly alerts Low-effort cost wins that don’t compromise on quality Low-effort cost wins that don’t compromise on quality Prompt trimming: remove redundant system text; compress long instructions into stable IDs resolved server-side.Context development: retrieval top-k tuned per query type; don’t fetch 20 docs “just in case.”Model routing: cheap model for classification/extraction; expensive model for synthesis.Caching:semantic cache for repeated Q&A patternstool result cache (e.g., “current plan benefits”) with TTLStop agent runaway: hard caps on steps/tool calls/tokens. Prompt trimming: remove redundant system text; compress long instructions into stable IDs resolved server-side. Prompt trimming Context development: retrieval top-k tuned per query type; don’t fetch 20 docs “just in case.” Context development Model routing: cheap model for classification/extraction; expensive model for synthesis. Model routing Caching: Caching semantic cache for repeated Q&A patterns tool result cache (e.g., “current plan benefits”) with TTL Stop agent runaway: hard caps on steps/tool calls/tokens. Stop agent runaway Evals: the must-have quality gate for prompt & tool changes Evals: the must-have quality gate for prompt & tool changes Microservices use unit tests, contract tests, canaries, and SLOs. LLM apps need the same, along with content evaluation. Evaluations (evals) are explicit tests of the LLM output against your expectations; these are essential for reliability, especially when changing models or versions of prompts. Evaluations (evals) A practical eval stack that works A practical eval stack that works Level 1: Prompt unit tests (fast, deterministic-ish) Level 1: Prompt unit tests (fast, deterministic-ish) input -> expected format constraints (JSON schema, required keys)basic refusal behavior on unacceptable inputstool selection rules (“should call get_policy_docs when user asks about policy”) input -> expected format constraints (JSON schema, required keys) basic refusal behavior on unacceptable inputs tool selection rules (“should call get_policy_docs when user asks about policy”) Level 2: Golden-set regression suite (CI gate) Level 2: Golden-set regression suite (CI gate) representative prompts and retrieved contextrubric-based scoring (helpfulness, correctness, groundedness, policy compliance)pass/fail thresholds by slice (region, tenant, language, scenario) representative prompts and retrieved context rubric-based scoring (helpfulness, correctness, groundedness, policy compliance) pass/fail thresholds by slice (region, tenant, language, scenario) Level 3: Shadow evaluation in production Level 3: Shadow evaluation in production sample live traffic (with privacy controls)evaluate outputs asynchronouslydetect drift and regressions without blocking requests sample live traffic (with privacy controls) evaluate outputs asynchronously detect drift and regressions without blocking requests Level 4: Human-in-the-loop for high-stakes domains Level 4: Human-in-the-loop for high-stakes domains annotation workflows for disputesperiodic adjustment of automated judges annotation workflows for disputes periodic adjustment of automated judges Pro tip: treat eval datasets like code. Version them. Review changes. Track coverage by scenario. Pro tip: treat eval datasets like code. Version them. Review changes. Track coverage by scenario. Drift detection: what changes even when you don’t deploy Drift detection: what changes even when you don’t deploy LLM systems drift for reasons that don’t show up in Git history: provider silently changes model behavioryour retrieval set changestool APIs change shapeuser behavior changes (new season, new product, new fraud patterns) provider silently changes model behavior your retrieval set changes tool APIs change shape user behavior changes (new season, new product, new fraud patterns) Use standards such as NIST’s AI RMF (and its generative AI profile) as a reference point for thinking about operational risk over time and not just “does it work today?” Drift signals worth monitoring that’ll provide key insights Drift signals worth monitoring that’ll provide key insights spike in tokens per response (often indicates prompt bloat or retrieval noise)increase in tool call count per request (agent loops or new ambiguity) rising refusal rate or policy blocksdrop in groundedness/citation rateincrease in format violations (JSON parsing errors)semantic shift in queries (topic embeddings distribution drift)new error clusters (timeouts, rate limits, tool failures) spike in tokens per response (often indicates prompt bloat or retrieval noise) increase in tool call count per request (agent loops or new ambiguity) rising refusal rate or policy blocks drop in groundedness/citation rate increase in format violations (JSON parsing errors) semantic shift in queries (topic embeddings distribution drift) new error clusters (timeouts, rate limits, tool failures) A good control plane turns drift into dashboards and alerts, not vague “users say it feels worse.” Reference architecture: how the AI Control Plane fits Reference architecture: how the AI Control Plane fits Here’s a vendor-neutral conceptual architecture that can be adopted: If you’re already managing microservices with OpenTelemetry, this is a natural extension, just with additional guardrails. LangFuse is a good open-source framework to get started on the AI control plane. Conclusion: you can’t scale what you can’t explain Conclusion: you can’t scale what you can’t explain Most “LLM incidents” aren’t model bugs. They are LLM app bugs that are missing trace context, unknown prompt versions, unmetered tool loops, or a lack of policy enforcement boundaries. An AI control plane doesn’t improve the intelligence of a model. What it does is make the overall system manageable and reliable, which is what allows teams to run and scale LLM applications safely.