LLMs are the first dependency most product teams ship where every request has a visible marginal cost. That changes the rules. A feature can be “working” and still be failing in production because it is quietly burning budget, retried into a spend spike, or expanding prompts until latency and cost both drift upward.
This post is a practical blueprint for keeping LLM costs predictable without turning the product into a stingy, low-quality experience.
Treat every LLM call as a billable dependency
In most services, dependencies cost latency and reliability. LLMs also cost money per request, which means you need per-route accounting from day one.
Attach each LLM request to a route such as support_reply, doc_search, meeting_summary, or policy_answer. Log the model, tokens in/out, latency, retries, and estimated cost. If you cannot answer “which route spent the money,” you cannot control it.
Cost is not finance work. It is an operational metric like error rate.
Put token budgets into the architecture (not into a spreadsheet)
Budgets need to exist where they can stop runaway behavior.
Define, per route:
- maximum input tokens (how much context you allow)
- maximum output tokens (how long the model can answer)
- maximum retries (especially on timeouts and 429s)
- maximum tool calls (if you use tools/function calling)
Budgets protect quality too. Unbounded prompts often reduce relevance by burying the signal under too much “maybe useful” context.
A helpful mindset is that a prompt is a packed suitcase. Every item needs a reason to be there.
Control context growth by retrieving smarter, not stuffing more
A common anti-pattern is “answers got worse, so we added more context.” That can increase cost and still fail because the right evidence is not present, only more evidence.
A better pattern is:
- retrieve a slightly larger candidate set
- rerank to pick the best few chunks
- pass only top evidence into the generation step
This keeps token usage stable and improves answer relevance. It also prevents a subtle cost drift where context slowly expands over weeks as more documents enter the index.
Cache strategically and measure cache hit rate
Caching is one of the biggest cost levers that doesn’t require sacrificing quality, but only if you cache the right layer and key it correctly.
Three practical caches:
- retrieval results for repeated queries (short TTL)
- embeddings for repeated texts (long TTL)
- final answers for deterministic routes (short TTL, strict keys)
Key design matters. Include the prompt template version, model name, and any doc index version in cache keys. Otherwise you will serve an answer that was generated for yesterday’s prompt or last week’s index.
And make cache hit rate visible by route. If hit rate is near zero, you are paying complexity tax for no benefit.
Route requests: not every task deserves the same model
Using one expensive model for every route is one of the fastest ways to lose cost control.
Route by task difficulty and risk:
- classification/extraction often works on smaller models
- summarisation can often be cheaper than open-ended reasoning
- high-risk routes can justify stronger models plus stricter guardrails
You can also route by evidence strength. If retrieval confidence is low, a better user experience is often asking a clarifying question, not upgrading to a more expensive model to hallucinate more eloquently.
Design fallbacks that preserve trust (not just budgets)
Cost controls that degrade experience get bypassed. The goal is to preserve trust while keeping spend predictable.
Good fallbacks include:
- ask a clarifying question when evidence is weak
- produce a short, cited answer instead of a long narrative
- return extracted passages rather than a generated explanation
- degrade gracefully under rate limits (queue, partial response, or “try again with a narrower query”)
The best fallback is not an error. It is safe partial usefulness.
Enforce per-route budgets + log cost
Here’s a small pattern you can adapt to most stacks: enforce budgets at runtime and emit structured usage logs for dashboards and alerts.
pythonCopyPRICING = {
"small": {"in_per_1k": 0.0002, "out_per_1k": 0.0008},
"large": {"in_per_1k": 0.0025, "out_per_1k": 0.0100},
}
ROUTE_BUDGETS = {
"support_reply": {"tier": "small", "max_in": 2000, "max_out": 250, "max_retries": 1},
"policy_answer": {"tier": "large", "max_in": 3500, "max_out": 350, "max_retries": 2},
"meeting_summary": {"tier": "small", "max_in": 4000, "max_out": 300, "max_retries": 1},
}
def est_cost(tier, tokens_in, tokens_out):
p = PRICING[tier]
return (tokens_in/1000)*p["in_per_1k"] + (tokens_out/1000)*p["out_per_1k"]
def run_llm(route, llm_call, tokens_in_estimate):
b = ROUTE_BUDGETS[route]
if tokens_in_estimate > b["max_in"]:
return {"text": "Too much context. Please narrow the request.", "blocked": True}
# your llm_call should return tokens_in/tokens_out from provider usage
resp = llm_call(tier=b["tier"], max_tokens=b["max_out"], max_retries=b["max_retries"])
cost = est_cost(b["tier"], resp["tokens_in"], resp["tokens_out"])
print({
"event": "llm_usage",
"route": route,
"tier": b["tier"],
"tokens_in": resp["tokens_in"],
"tokens_out": resp["tokens_out"],
"cost_est_usd": round(cost, 6),
})
return {"text": resp["text"], "blocked": False}
Even if your initial pricing estimates are rough, you get two things immediately: you prevent obvious runaway prompts and you create a cost history you can analyze.
Closing: budgets are a product feature
If you ship LLM features without budgets, you are not shipping a feature. You are shipping an unbounded cost center.
Budgets force clarity: what evidence matters, what answer format is acceptable, and how the system behaves under uncertainty. The payoff is not only lower spend. It is a system that is easier to operate, easier to debug, and easier to trust.
