There's a growing industry selling you specialized LLM monitoring tools. Fancy dashboards showing prompt histories, evaluation scores, token counts with nice graphs.
I tried three of them. They all had the same problem: they existed in isolation from our actual infrastructure. When something went wrong, I'd be jumping between tabs, trying to correlate a spike in one system with logs in another, never quite sure if the timeline even matched up.
Then I stopped fighting it and just instrumented our LLM calls with OpenTelemetry, same as everything else. Turns out that was the answer all along.
Why Your Current Monitoring Stack Is Blind to LLM Failures
Here's the uncomfortable truth: traditional APM was designed for deterministic systems. You send a request, you get a predictable response. If the database query returns the wrong data, something is broken and you'll see it in your error rates.
LLMs don't work that way. The same prompt can generate different outputs every time. "Quality" isn't binary—it's a spectrum. And the failure modes are completely different from what our monitoring tools expect.
Think about it. When your LLM starts giving worse answers, what metric captures that? Latency? Still fine. Error rate? Zero—technically every request succeeded. Throughput? Unchanged. The Golden Signals that have guided site reliability for a decade are suddenly useless.
The knee-jerk reaction is to bolt on a separate "LLM monitoring" tool, Langfuse, Langsmith, whatever. And look, those tools have their place. But now you've got two disconnected systems: your infrastructure observability over here, your LLM evaluation metrics over there, and no clean way to correlate them when something goes wrong at 2am.
What we actually need is to extend our existing observability stack to understand the unique characteristics of AI workloads. That's where OpenTelemetry's GenAI semantic conventions come in.
OpenTelemetry Finally Has a Standard for AI Telemetry
If you've worked with OpenTelemetry before, you know about semantic conventions—the standardized attribute names that make telemetry portable across vendors. There are conventions for HTTP, databases, messaging systems.
Now there's one for generative AI. It's still marked "experimental," but it's already supported by most of the major instrumentation libraries, and it solves the interoperability problem that's been plaguing LLM observability.
The key attributes you need to know:
gen_ai.system — The provider (openai, anthropic, bedrock)
gen_ai.request.model — What you asked for (gpt-4o)
gen_ai.response.model — What actually responded (might be different due to aliasing)
gen_ai.usage.input_tokens — Prompt tokens consumed
gen_ai.usage.output_tokens — Completion tokens generated
gen_ai.response.finish_reason — Why the model stopped (this one's more important than you think)
I'll come back to finish_reason later—it turned out to be the most useful thing on that list.
Getting Started: The Five-Minute Setup
Let's get something working. If you're using Python with the OpenAI SDK, instrumentation takes about thirty seconds:
pip install opentelemetry-instrumentation-openai
Then in your code:
from opentelemetry.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument()
That's it. Every OpenAI call now generates spans with all the GenAI attributes. They'll flow through your existing collector to whatever backend you're using—Jaeger, Grafana Tempo, Honeycomb, doesn't matter.
Using LangChain? There's an instrumentor for that too:
pip install opentelemetry-instrumentation-langchain
from opentelemetry.instrumentation.langchain import LangchainInstrumentor
LangchainInstrumentor().instrument()
Now your entire chain—retrieval, prompt formatting, model calls, output parsing—shows up as a connected trace. When something's slow, you can see exactly which step is the bottleneck.
Latency Isn't a Single Number Anymore
Here's something that took me embarrassingly long to understand: for streaming LLM responses, traditional latency metrics are almost meaningless.
When a user sends a message to your chatbot, they don't care how long until the full response arrives. They care how long until they see the first word appear. That's "time to first token" (TTFT), and it's the metric that actually correlates with perceived responsiveness.
A 10-second response time is totally acceptable if the first token arrives in 200ms and words stream smoothly after that. But if nothing appears for 3 seconds? Your users will assume it's broken and start mashing the retry button.
The GenAI metrics spec defines this properly. You want to track:
Time to First Token (TTFT) — How long before the first chunk arrives
Time per Output Token (TPOT) — Average generation speed after the first token
End-to-End Duration — Total time, mostly useful for non-streaming calls
For our chatbot, I set SLOs of TTFT < 500ms at p95 and TPOT < 50ms. When TTFT degrades, it usually means the model provider is overloaded or we're sending too much context. When TPOT degrades, it's often a sign we need to look at our max_tokens settings.
The Quality Signal Hiding in Plain Sight
Remember that finish_reason attribute I mentioned? Let me tell you why I'm obsessed with it.
Every LLM response includes a reason for why generation stopped. The main values are:
stop — The model finished its thought naturally. This is what you want.
length — The model hit the token limit and got cut off mid-sentence. This is almost always bad.
content_filter — A safety filter triggered. Either your users are doing something sketchy or your prompts need work.
We built a dashboard that tracks the distribution of finish_reason over time. Our baseline was roughly 95% stop, 4% length, 1% content_filter.
One morning, the "length" percentage started climbing. Within a few hours it was at 15%. Turns out someone had updated a system prompt that added way more context, burning through our token budget before the model could finish responding. Users were getting truncated, useless answers—but every single metric except this one showed green.
We now alert when "length" exceeds 8% for more than 10 minutes. It's caught three issues since then.
Tokens Are the New Compute—Track Them Like It
Here's a conversation I've had with finance more times than I'd like: "Why did our AI spend go up 40% last month?"
If you can't answer that question quickly and accurately, you're going to have a bad time. Token costs are variable in a way that traditional compute isn't. One chatty user can burn through more budget than a thousand normal ones.
The solution is cost attribution at the observability layer. We add custom attributes to every span:
span.set_attribute("customer.id", customer_id)
span.set_attribute("feature.name", "support_chatbot")
span.set_attribute("customer.tier", "enterprise")
Combined with the token usage attributes, we can now slice cost by customer, by feature, by model—whatever dimension matters. When finance asks about that 40% spike, I can tell them: "The summarization feature for enterprise customers using GPT-4 doubled in usage because we onboarded Acme Corp."
Much better answer than "I don't know, AI is expensive."
Tracing RAG Pipelines: Finding Where Quality Dies
If you're building anything serious with LLMs, you're probably doing retrieval-augmented generation. Documents go into a vector database, relevant chunks get retrieved, and those chunks become context for the model.
RAG pipelines fail in interesting ways. Sometimes the retrieval is fast but returns garbage. Sometimes it returns great documents but the model ignores them. Auto-instrumentation will show you the timing, but not the causal relationships.
This is where manual instrumentation pays off. Wrap your logical operations in spans:
with tracer.start_as_current_span("rag_workflow") as parent:
with tracer.start_as_current_span("retrieval") as retrieval_span:
docs = vector_db.query(question)
retrieval_span.set_attribute("docs.count", len(docs))
# LLM call auto-instrumented as child span
response = llm.generate(question, docs)
Now when you look at a trace, you see the whole picture: retrieval took 50ms and returned 3 documents, generation took 2 seconds. If quality is bad and retrieval returned zero documents, that's your culprit. If it returned 10 documents but the answer still sucks, maybe you're retrieving the wrong things.
A Quick Word on Privacy
Prompts often contain sensitive data. Before you enable full content capture, think about where that data is going.
The GenAI instrumentation libraries support capturing prompts and completions as span events, but it's opt-in for a reason. If you're shipping telemetry to a third-party backend, you probably don't want customer PII in there.
Our approach: enable capture in the SDK, then use the OTel Collector as a privacy firewall. A transform processor strips or hashes sensitive patterns before data leaves our network. This gives us debugging capability in development while maintaining compliance in production.
The Dashboard That Actually Matters
After months of iteration, here's what we actually look at daily:
Panel 1: TTFT p50/p95/p99 over time, overlaid with request rate. This tells you if latency changes are load-related.
Panel 2: Finish reason distribution. Pie chart updated hourly. Any shift from baseline triggers investigation.
Panel 3: Token cost by feature. Stacked bar chart showing where the money goes.
Panel 4: Error rate by model/provider. Helps identify when issues are on OpenAI's side vs ours.
That's it. Four panels. Everything else we tried ended up being noise.
Stop Buying Tools. Start Instrumenting.
The LLM observability market wants to sell you another dashboard. Another vendor. Another tab to keep open.
But the actual problem isn't lack of tooling, it's lack of integration. Your AI services need to be observable in the same way, using the same standards, flowing to the same backends as everything else you run.
OpenTelemetry gives you that. The semantic conventions are defined. The instrumentation libraries exist for every major LLM client and framework. The data flows through your existing collectors to whatever backend you already use.
Start small. Install the instrumentor for whatever LLM client you're using. Get the traces flowing. Add one dashboard with TTFT and finish_reason distribution. That alone will put you ahead of 90% of teams running AI in production.
Then iterate. Add cost attribution. Instrument your RAG pipeline. Set up alerts on quality signals.
You don't need another dashboard. You need observability that actually works.
