Every enterprise AI integration starts the same way. A developer writes client.chat("gpt-4", ...) somewhere in the codebase, ships it, and moves on.
Six months later, the organization is running 14 different AI-powered features across 6 services, each with hardcoded model references, scattered API keys, and zero visibility into which model is handling what data. The finance team wants cost controls.
Legal wants to know if customer PII is hitting external APIs. Platform engineering wants to consolidate providers. And every single change requires a code deploy.
This is the model selection problem, and most teams don't realize they have it until it's already expensive.
The solution isn't another wrapper library. It's a policy layer — a configuration-driven system that decouples what model to use from where the call happens, governed by rules that non-engineers can reason about.
Why Hardcoded Model Selection Breaks at Scale
Before diving into architecture, it's worth understanding why direct model calls fail in enterprise environments. The failure modes are predictable.
Provider lock-in accelerates silently. When model identifiers are scattered across services, switching from OpenAI to Anthropic or a self-hosted model becomes a multi-sprint migration. Each integration point is a separate refactor. Teams that planned for "easy swapping" discover that prompt formats, token limits, and response structures differ enough to make swaps non-trivial even with an abstraction layer.
Cost visibility disappears. Without centralized routing, there's no single point where you can answer "how much are we spending on GPT-4 vs. Claude Sonnet vs. Llama for classification tasks?" Chargebacks to product teams become guesswork. Optimization is impossible without measurement.
Compliance becomes a game of whack-a-mole. GDPR, HIPAA, SOC 2, and internal data classification policies all impose constraints on where data can go. If model selection lives in application code, enforcing data routing policies means auditing every service. One missed integration and sensitive data flows to an external provider it shouldn't.
Experimentation requires deployment. Want to A/B test whether Claude Sonnet handles customer support summarization better than GPT-4o-mini? With hardcoded calls, that's a code change, a PR, a review, a deploy, and monitoring — for each experiment.
These aren't theoretical problems. They're the reality at any organization running AI at scale.
The Architecture: A Policy Engine for Model Selection
The core idea is straightforward: introduce a layer between your application code and model providers that selects models based on declarative policies rather than imperative code.
Here's the high-level architecture:
┌─────────────────────────────────────────────────────┐
│ Application Code │
│ │
│ ai.complete(task="summarize", input=doc, meta={}) │
└──────────────────────┬──────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Policy Resolution Engine │
│ │
│ 1. Match request to policy rules │
│ 2. Evaluate constraints (cost, latency, compliance) │
│ 3. Select model + provider │
│ 4. Apply transformations (prompt template, params) │
└──────────────────────┬───────────────────────────────┘
│
┌───────────┼───────────┐
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌──────────┐
│ OpenAI │ │Anthropic│ │ Self-Host │
│ Adapter │ │ Adapter │ │ Adapter │
└─────────┘ └─────────┘ └──────────┘
The application code never references a specific model. It declares intent — what task it's performing, what data classification it's working with, what quality tier it needs. The policy engine resolves that intent to a concrete model, provider, and configuration.
Designing the Policy Schema
The policy schema is the heart of the system. It needs to be expressive enough to capture real enterprise constraints while staying readable by non-engineers. YAML is the pragmatic choice — it's version-controllable, diff-friendly, and most ops teams already know it.
Here's a schema that covers the critical dimensions:
yaml
# model-policies.yaml
version: "2.0"
defaults:
max_latency_ms: 5000
max_cost_per_request: 0.05
fallback_strategy: "cascade"
retry:
max_attempts: 3
backoff_multiplier: 1.5
data_classifications:
public:
allowed_providers: ["openai", "anthropic", "google", "self-hosted"]
internal:
allowed_providers: ["anthropic", "openai", "self-hosted"]
require_encryption_at_rest: true
confidential:
allowed_providers: ["self-hosted", "anthropic"]
require_encryption_at_rest: true
require_dpa: true
restricted:
allowed_providers: ["self-hosted"]
require_encryption_at_rest: true
require_audit_log: true
require_vpc: true
policies:
- name: "customer-support-summarization"
match:
task: "summarize"
domain: "customer-support"
data_classification: "confidential"
routing:
primary:
provider: "anthropic"
model: "claude-sonnet-4-20250514"
max_tokens: 1024
temperature: 0.3
fallback:
- provider: "self-hosted"
model: "llama-3.1-70b"
max_tokens: 1024
temperature: 0.3
constraints:
max_latency_ms: 3000
max_input_tokens: 8000
cost_tier: "standard"
- name: "internal-code-review"
match:
task: "code-review"
domain: "engineering"
data_classification: "internal"
routing:
primary:
provider: "anthropic"
model: "claude-sonnet-4-20250514"
temperature: 0.2
fallback:
- provider: "openai"
model: "gpt-4o"
temperature: 0.2
constraints:
max_latency_ms: 10000
cost_tier: "premium"
- name: "bulk-classification"
match:
task: "classify"
priority: "batch"
data_classification: "internal"
routing:
primary:
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.0
fallback:
- provider: "self-hosted"
model: "llama-3.1-8b"
constraints:
max_cost_per_request: 0.002
cost_tier: "economy"
- name: "default-catch-all"
match:
task: "*"
data_classification: "public"
routing:
primary:
provider: "openai"
model: "gpt-4o-mini"
constraints:
cost_tier: "economy"
Several design decisions here are worth calling out.
Match rules use task + domain, not service names. This is intentional. If you route based on which microservice is calling, you couple policies to your service topology. When services get refactored or merged, policies break. Routing based on task semantics ("summarize," "classify," "generate") and business domain ("customer-support," "engineering") is stable across architectural changes.
Data classification drives provider constraints. The data_classifications block acts as a hard gate. Even if a policy says to use OpenAI, if the request is tagged as restricted, the classification rules override it. This creates a layered enforcement model where compliance rules are structurally impossible to bypass through policy misconfiguration.
Fallback chains are explicit. Rather than retry the same model, the cascade strategy tries different providers. If Anthropic is down, the system falls back to the self-hosted Llama instance. This is both a reliability pattern and a cost optimization — fallback models can be cheaper alternatives that are acceptable for degraded operation.
The Policy Resolution Algorithm
The resolution engine needs to be fast (it's in the hot path of every AI call) and deterministic (same inputs always produce the same routing decision). Here's the logic:
RESOLVE(request):
1. Extract metadata: task, domain, priority, data_classification
2. Find matching policies (ordered by specificity)
3. For each matched policy:
a. Check data_classification constraints
b. Evaluate cost/latency constraints against current metrics
c. Check provider health status
d. If all pass → return routing config
4. If primary fails constraints → walk fallback chain
5. If all fail → return default policy or reject
Specificity ordering matters. A request with task=summarize, domain=customer-support should match the customer-support-summarization policy over the default-catch-all. The resolution engine scores matches by the number of matched fields, with exact matches ranking above wildcards.
The critical implementation detail is step 2b: evaluating constraints against current metrics. This means the policy engine needs access to real-time (or near-real-time) data about provider latency, costs, and availability. This creates a feedback loop:
Request → Policy Engine → Provider → Response
↑ │
└── Metrics Collector ◄──┘
The metrics collector feeds back p50/p95 latency, error rates, and per-request costs. The policy engine uses these to make dynamic routing decisions. If Anthropic's p95 latency spikes above the policy's max_latency_ms, the engine automatically routes to the fallback without any human intervention.
Implementing the Abstraction Layer
The application-facing API should be minimal. Developers shouldn't need to understand the policy system to use it. Here's what the interface looks like:
python
from ai_gateway import complete, classify, summarize
# Option 1: Task-specific functions
result = summarize(
input=document_text,
domain="customer-support",
data_classification="confidential",
metadata={"ticket_id": "TICK-4521"}
)
# Option 2: Generic interface with task declaration
result = complete(
task="summarize",
input=document_text,
context={
"domain": "customer-support",
"data_classification": "confidential",
}
)
# The developer never specifies a model.
# The policy engine handles selection, fallback, retries, and logging.
Under the hood, this client library does the following:
- Packages the request with metadata
- Sends it to the policy resolution engine (either in-process or via a gateway service)
- The engine resolves the policy, selects the model, and formats the provider-specific request
- Executes the call with retry/fallback logic
- Returns a normalized response regardless of which provider handled it
The response normalization is subtle but critical. Each provider returns slightly different structures. The gateway abstracts this so downstream code doesn't need provider-specific parsing.
python
@dataclass
class AIResponse:
content: str
model_used: str # Which model actually handled this
provider: str # Which provider
latency_ms: int # End-to-end latency
input_tokens: int # Normalized token counts
output_tokens: int
cost_usd: float # Estimated cost
policy_name: str # Which policy was applied
fallback_used: bool # Whether primary failed
request_id: str # For tracing
This response structure gives observability teams everything they need. You can now answer questions like "what percentage of customer-support requests fell back to the self-hosted model last week?" directly from your metrics pipeline.
Handling the Hard Parts
Prompt Compatibility Across Models
Models aren't interchangeable. A prompt optimized for Claude might perform differently on GPT-4o or Llama. The policy system needs to handle this.
The cleanest approach is prompt templates keyed by model family:
yaml
prompt_templates:
summarize:
anthropic:
system: "You are a precise summarizer for customer support tickets. Extract the core issue, customer sentiment, and any action items."
template: "Summarize the following support ticket:\n\n{input}\n\nProvide: 1) Core issue 2) Sentiment 3) Action items"
openai:
system: "Summarize customer support tickets concisely."
template: "Ticket content:\n{input}\n\nSummarize with: core issue, sentiment, action items."
self-hosted:
template: "[INST] Summarize this support ticket. Include the core issue, customer sentiment, and action items.\n\n{input} [/INST]"
This adds maintenance overhead, but it's the honest solution. The alternative — pretending all models respond identically to the same prompt — leads to subtle quality degradation that's hard to debug.
Cost Budgets and Rate Limiting
Enterprise systems need spending guardrails. The policy layer is the natural place for this:
yaml
budgets:
global:
daily_limit_usd: 5000
alert_threshold: 0.8 # Alert at 80%
per_team:
engineering:
monthly_limit_usd: 15000
customer-support:
monthly_limit_usd: 8000
marketing:
monthly_limit_usd: 3000
per_policy:
customer-support-summarization:
daily_limit_usd: 500
rate_limit: 1000/hour
When a budget is exhausted, the policy engine can either reject requests, downgrade to a cheaper model automatically, or queue requests for batch processing. The behavior is — you guessed it — configurable via policy.
Model Versioning and Canary Deployments
When OpenAI releases a new model or Anthropic updates Claude, you don't want to switch all traffic instantly. The policy layer supports canary-style rollouts:
yaml
policies:
- name: "customer-support-summarization"
match:
task: "summarize"
domain: "customer-support"
routing:
primary:
provider: "anthropic"
model: "claude-sonnet-4-20250514"
weight: 90
canary:
provider: "anthropic"
model: "claude-sonnet-4-20250514" # newer version
weight: 10
evaluation:
metrics: ["quality_score", "latency", "cost"]
min_sample_size: 500
auto_promote_threshold: 0.95
Ten percent of traffic goes to the canary. Once the system collects 500 samples and the quality score exceeds the threshold, it can auto-promote the canary to primary. This is the same pattern used for service deployments, applied to model versions.
The Governance Layer
For larger organizations, the policy files themselves need governance. Who can change routing for PII-handling tasks? Who approves adding a new provider?
This is where GitOps shines. Policy files live in a dedicated repository with branch protection:
ai-policies/
├── policies/
│ ├── customer-support.yaml
│ ├── engineering.yaml
│ ├── marketing.yaml
│ └── defaults.yaml
├── classifications/
│ └── data-classes.yaml
├── budgets/
│ └── team-budgets.yaml
├── tests/
│ ├── test_policy_resolution.py
│ └── test_classification_enforcement.py
└── CODEOWNERS
The CODEOWNERS file enforces review requirements:
# Data classification changes require security team approval
classifications/ @security-team
# Budget changes require finance approval
budgets/ @finance-team @platform-team
# Policy changes require domain team + platform review
policies/customer-support.yaml @cs-team @platform-team
policies/engineering.yaml @eng-leads @platform-team
Changes are validated by CI before merge. Tests verify that no policy routes restricted data to external providers, that budget limits are within organizational bounds, and that every task has at least one valid routing path.
Observability: The Payoff
Once all AI traffic flows through the policy layer, you get centralized observability for free. Every request generates a structured event:
json
{
"timestamp": "2026-02-16T14:30:00Z",
"request_id": "req_abc123",
"task": "summarize",
"domain": "customer-support",
"data_classification": "confidential",
"policy_applied": "customer-support-summarization",
"provider": "anthropic",
"model": "claude-sonnet-4-20250514",
"fallback_used": false,
"input_tokens": 2847,
"output_tokens": 312,
"latency_ms": 1243,
"estimated_cost_usd": 0.0089,
"team": "cs-operations",
"quality_score": null
}
From this, you can build dashboards that answer the questions executives actually ask: "How much are we spending on AI per team?" "Which tasks have the highest error rates?" "Are we compliant with our data handling policies?" "What would happen to our costs if we moved all classification tasks to the self-hosted model?"
That last question is particularly powerful. Because the policy layer is declarative, you can run simulations against historical traffic to model the impact of policy changes before deploying them.
What This Is Not
This architecture is not a universal AI gateway product. It's a pattern for teams building AI into enterprise systems that need control and visibility without sacrificing development velocity.
It doesn't solve prompt engineering. It doesn't solve evaluation. It doesn't replace your ML platform. What it does is give you a single, auditable, policy-driven control plane for the question of "which model handles this request and why."
For teams running fewer than five AI-powered features on a single provider, this is overkill. For everyone else — especially in regulated industries, multi-provider environments, or organizations where AI costs are becoming a line item worth managing — this pattern pays for itself the first time someone asks, "Can we switch providers for this workload?" and the answer is a YAML change instead of a sprint.
Getting Started
If you're building this from scratch, here's the pragmatic sequence:
Start with the abstraction layer. Get your application code calling complete(task=..., domain=...) instead of provider-specific clients, even if the "policy engine" is initially just a hardcoded mapping. This decoupling is the most valuable change and costs the least.
Add the policy file next. Replace the hardcoded mapping with YAML-driven configuration. Keep it simple — task-to-model mappings with a default fallback.
Layer in data classification enforcement. This is where you get compliance value. Hard-gate provider access based on data sensitivity.
Build the metrics feedback loop. Collect latency, cost, and error data per policy. Use it for reporting first, then for dynamic routing decisions.
Add cost controls and canary support as you need them. These are refinements, not foundations.
The goal isn't to build the full system on day one. It's to establish the pattern early so that every new AI integration automatically inherits routing policies, cost controls, compliance enforcement, and observability without any additional work from the feature team.
