Introduction
LLM observability is harder than traditional APM because the unit of work isn't a request—it's a multi-turn conversation with non-deterministic outputs, retrieval steps, tool calls, and token-based pricing. Standard metrics (request rate, error rate, latency) are necessary but insufficient. This guide covers the instrumentation, metrics, and debugging workflows needed to run LLM apps in production.
The Three Layers of LLM Observability
LLM apps require instrumentation at three distinct layers:
Layer 1: Request Layer (Standard APM)
Traditional observability: HTTP requests, status codes, latency percentiles.
- Request rate (requests/sec)
- Error rate (4xx, 5xx)
- Latency (p50, p95, p99)
- Throughput (bytes/sec)
Tools: Datadog, New Relic, Prometheus
Layer 2: LLM Call Layer (Model Metrics)
Tracks individual LLM API calls with token-level granularity:
- Prompt tokens, completion tokens, total tokens
- Model name and version
- API latency (time to first token, total generation time)
- Cost per call (tokens × price/token)
- Error type (rate limit, timeout, invalid request)
Tools: LangSmith, Helicone, Braintrust, OpenLLMetry
Layer 3: Semantic Layer (LLM-Specific)
Tracks the quality and behavior of LLM outputs:
- Prompt template used
- Actual prompt sent (with variables filled)
- Full response text
- Tool calls made (function name, args, result)
- Retrieval context (chunks retrieved, scores)
- User feedback (thumbs up/down, corrections)
Tools: LangSmith, Arize, WhyLabs, custom logging
The Minimal Instrumentation Setup
Every LLM call should emit a structured log with these fields:
import time
import json
import logging
logger = logging.getLogger("llm_observability")
def log_llm_call(
prompt: str,
response: str,
model: str,
prompt_tokens: int,
completion_tokens: int,
latency_ms: float,
cost_usd: float,
metadata: dict = None
):
log_entry = {
"timestamp": time.time(),
"model": model,
"prompt": prompt,
"response": response,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
"latency_ms": latency_ms,
"cost_usd": cost_usd,
"metadata": metadata or {}
}
logger.info(json.dumps(log_entry))
Ship these logs to a data warehouse (BigQuery, Snowflake) or observability platform (Datadog, Splunk) for analysis.
Cost Tracking and Forecasting
LLM costs are variable and can spike unexpectedly. Track at three granularities:
Per-Call Cost
PRICING = {
"claude-3-5-sonnet-20250219": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
"gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
"gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000}
}
def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
pricing = PRICING[model]
return (prompt_tokens * pricing['input']) + (completion_tokens * pricing['output'])
Per-User Cost
Track cumulative cost per user to identify abusers and set rate limits:
SELECT user_id, SUM(cost_usd) as total_cost
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL 30 DAY
GROUP BY user_id
ORDER BY total_cost DESC
LIMIT 100
Set per-user budgets: free tier ($1/month), pro tier ($50/month).
Per-Feature Cost
Attribute costs to product features to inform pricing:
SELECT feature, SUM(cost_usd) as total_cost, COUNT(*) as calls
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL 7 DAY
GROUP BY feature
ORDER BY total_cost DESC
If "code generation" costs 10× more than "text summarization", price them differently.
Latency Optimization
LLM latency has two components: time to first token (TTFT) and tokens per second (TPS).
Time to First Token (TTFT)
Measures how long before the model starts generating. Dominated by:
- Prompt length (longer = slower)
- Model size (larger = slower)
- Server load (queuing time)
Optimization:
- Cache common prompts (exact match or semantic cache)
- Use smaller models for simple tasks (GPT-4o-mini vs GPT-4o)
- Parallelize independent LLM calls
Tokens Per Second (TPS)
Measures generation speed. Determined by model architecture and hardware.
Benchmarks (approximate):
- GPT-4o: ~40 tokens/sec
- Claude 3.5 Sonnet: ~60 tokens/sec
- GPT-4o-mini: ~80 tokens/sec
- Llama 3 8B (self-hosted on A100): ~120 tokens/sec
Optimization:
- Stream responses to user (don't wait for full completion)
- Set max_tokens limits (stop generating at 500 tokens if that's enough)
- Use speculative decoding or batch inference for high-throughput scenarios
Latency Monitoring Query
SELECT
model,
percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) as p50_latency,
percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95_latency,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99_latency
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
GROUP BY model
Prompt Monitoring
Prompts drift. Variables change. Templates get updated. Monitor:
Template Version Tracking
Tag each LLM call with the template version used:
log_llm_call(
prompt=filled_prompt,
response=response,
metadata={
"template_id": "code_review_v3",
"template_version": "2026-03-15",
"variables": {"language": "python", "file": "main.py"}
}
)
If you roll out a new prompt template, compare metrics before/after:
SELECT template_version, AVG(user_satisfaction) as avg_satisfaction
FROM llm_logs
WHERE template_id = 'code_review'
GROUP BY template_version
Prompt Length Distribution
Plot the distribution of prompt lengths over time:
SELECT DATE(timestamp) as date, AVG(prompt_tokens) as avg_tokens
FROM llm_logs
GROUP BY date
ORDER BY date
If average prompt length increases 50% overnight, someone broke prompt caching or is leaking context.
Outlier Detection
Alert on prompts that are abnormally long or expensive:
SELECT prompt, prompt_tokens, cost_usd
FROM llm_logs
WHERE prompt_tokens > 10000 OR cost_usd > 1.00
ORDER BY cost_usd DESC
LIMIT 50
Investigate why a single call cost $5.
Tool Call Observability
When LLMs call tools (functions), log the tool invocation separately:
def log_tool_call(
tool_name: str,
args: dict,
result: any,
latency_ms: float,
success: bool,
error: str = None
):
log_entry = {
"timestamp": time.time(),
"tool_name": tool_name,
"args": args,
"result": result,
"latency_ms": latency_ms,
"success": success,
"error": error
}
logger.info(json.dumps(log_entry))
Tool Success Rate
SELECT tool_name, COUNT(*) as calls, AVG(CAST(success AS INT)) as success_rate
FROM tool_logs
GROUP BY tool_name
ORDER BY success_rate ASC
If "search_web" has 40% success rate, the tool is broken or the LLM is calling it wrong.
Tool Latency
SELECT tool_name, AVG(latency_ms) as avg_latency
FROM tool_logs
GROUP BY tool_name
ORDER BY avg_latency DESC
If "fetch_user_data" takes 3 seconds on average, optimize the database query.
User Feedback Loop
Capture explicit feedback (thumbs up/down) and implicit signals (retry, edit, abandon):
class FeedbackEvent:
user_id: str
session_id: str
message_id: str
feedback_type: str # "thumbs_up", "thumbs_down", "retry", "edit", "abandon"
feedback_text: str = None # optional user comment
timestamp: float
def log_feedback(event: FeedbackEvent):
logger.info(json.dumps(event.__dict__))
Satisfaction Rate by Feature
SELECT feature, AVG(CASE WHEN feedback_type = 'thumbs_up' THEN 1.0 ELSE 0.0 END) as satisfaction
FROM feedback_events
GROUP BY feature
ORDER BY satisfaction ASC
Correlate Feedback with Latency
Do slow responses get more thumbs down?
SELECT
CASE WHEN latency_ms < 1000 THEN 'fast' WHEN latency_ms < 3000 THEN 'medium' ELSE 'slow' END as speed,
AVG(CASE WHEN feedback_type = 'thumbs_up' THEN 1.0 ELSE 0.0 END) as satisfaction
FROM llm_logs JOIN feedback_events ON llm_logs.message_id = feedback_events.message_id
GROUP BY speed
Debugging Workflow
When users report "bad output", here's the investigation flow:
1. Find the Exact LLM Call
SELECT * FROM llm_logs WHERE session_id = 'abc123' ORDER BY timestamp
Get the full prompt, response, model, and metadata.
2. Replay the Prompt
Re-run the exact prompt against the same model:
response = llm.invoke(original_prompt, model=original_model)
Does it reproduce the issue? LLMs are stochastic, so set temperature=0 for determinism.
3. Check Retrieved Context (for RAG)
If this was a RAG call, inspect what was retrieved:
SELECT * FROM retrieval_logs WHERE session_id = 'abc123'
Were the retrieved chunks relevant? See the RAG evaluation section for diagnosis.
4. Check Tool Calls
SELECT * FROM tool_logs WHERE session_id = 'abc123' ORDER BY timestamp
Did any tools fail? Did the LLM pass malformed args?
5. Compare to Baselines
Run the same query against different models or prompt templates:
results = []
for model in ["gpt-4o", "claude-3-5-sonnet-20250219", "gpt-4o-mini"]:
response = llm.invoke(prompt, model=model)
results.append({"model": model, "response": response})
If GPT-4o works but Claude fails, it's a model-specific issue (prompt not robust).
Alerting Strategy
Set up these alerts:
Cost Spike
Alert if hourly LLM spend exceeds 3× the 7-day average:
IF current_hour_spend > 3 × avg_hourly_spend_7d THEN alert
Error Rate Spike
Alert if LLM error rate exceeds 5%:
IF (errors / total_calls) > 0.05 THEN alert
Latency Degradation
Alert if p95 latency exceeds SLA (e.g., 5 seconds):
IF p95_latency > 5000ms THEN alert
Prompt Drift
Alert if average prompt length changes >30%:
IF abs(current_avg_prompt_tokens - baseline_avg_prompt_tokens) / baseline_avg_prompt_tokens > 0.3 THEN alert
The Observability Stack
Recommended tools by use case:
| Need | Tool | Notes |
|---|---|---|
| LLM call tracing | LangSmith | Best for LangChain apps |
| Cost tracking | Helicone | Proxy-based, works with any provider |
| Model evaluation | Braintrust | A/B testing, prompt versioning |
| OpenTelemetry | OpenLLMetry | Open-source, vendor-neutral |
| Production monitoring | Arize / WhyLabs | Drift detection, anomaly alerts |
Conclusion
LLM observability requires instrumenting three layers: standard request metrics, token-level LLM metrics, and semantic quality metrics. Log every LLM call with full context. Track costs per-user and per-feature. Monitor latency and set SLA alerts. Build feedback loops to correlate user satisfaction with technical metrics. And when debugging, always start with the exact prompt and response—everything else is derivative.
Build Better AI Tools
DevKits provides developer tools for JSON formatting, Base64 encoding, regex testing, and more — all free and privacy-first.
Try DevKits →