LLM Observability for Production: Tracing, Cost, and Debugging

Introduction

LLM observability is harder than traditional APM because the unit of work isn't a request—it's a multi-turn conversation with non-deterministic outputs, retrieval steps, tool calls, and token-based pricing. Standard metrics (request rate, error rate, latency) are necessary but insufficient. This guide covers the instrumentation, metrics, and debugging workflows needed to run LLM apps in production.

The Three Layers of LLM Observability

LLM apps require instrumentation at three distinct layers:

Layer 1: Request Layer (Standard APM)

Traditional observability: HTTP requests, status codes, latency percentiles.

Request rate (requests/sec)
Error rate (4xx, 5xx)
Latency (p50, p95, p99)
Throughput (bytes/sec)

Tools: Datadog, New Relic, Prometheus

Layer 2: LLM Call Layer (Model Metrics)

Tracks individual LLM API calls with token-level granularity:

Prompt tokens, completion tokens, total tokens
Model name and version
API latency (time to first token, total generation time)
Cost per call (tokens × price/token)
Error type (rate limit, timeout, invalid request)

Tools: LangSmith, Helicone, Braintrust, OpenLLMetry

Layer 3: Semantic Layer (LLM-Specific)

Tracks the quality and behavior of LLM outputs:

Prompt template used
Actual prompt sent (with variables filled)
Full response text
Tool calls made (function name, args, result)
Retrieval context (chunks retrieved, scores)
User feedback (thumbs up/down, corrections)

Tools: LangSmith, Arize, WhyLabs, custom logging

The Minimal Instrumentation Setup

Every LLM call should emit a structured log with these fields:

import time
import json
import logging

logger = logging.getLogger("llm_observability")

def log_llm_call(
    prompt: str,
    response: str,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    latency_ms: float,
    cost_usd: float,
    metadata: dict = None
):
    log_entry = {
        "timestamp": time.time(),
        "model": model,
        "prompt": prompt,
        "response": response,
        "prompt_tokens": prompt_tokens,
        "completion_tokens": completion_tokens,
        "total_tokens": prompt_tokens + completion_tokens,
        "latency_ms": latency_ms,
        "cost_usd": cost_usd,
        "metadata": metadata or {}
    }
    logger.info(json.dumps(log_entry))

Ship these logs to a data warehouse (BigQuery, Snowflake) or observability platform (Datadog, Splunk) for analysis.

Cost Tracking and Forecasting

LLM costs are variable and can spike unexpectedly. Track at three granularities:

Per-Call Cost

PRICING = {
    "claude-3-5-sonnet-20250219": {"input": 3.00 / 1_000_000, "output": 15.00 / 1_000_000},
    "gpt-4o": {"input": 2.50 / 1_000_000, "output": 10.00 / 1_000_000},
    "gpt-4o-mini": {"input": 0.15 / 1_000_000, "output": 0.60 / 1_000_000}
}

def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    pricing = PRICING[model]
    return (prompt_tokens * pricing['input']) + (completion_tokens * pricing['output'])

Per-User Cost

Track cumulative cost per user to identify abusers and set rate limits:

SELECT user_id, SUM(cost_usd) as total_cost
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL 30 DAY
GROUP BY user_id
ORDER BY total_cost DESC
LIMIT 100

Set per-user budgets: free tier ($1/month), pro tier ($50/month).

Per-Feature Cost

Attribute costs to product features to inform pricing:

SELECT feature, SUM(cost_usd) as total_cost, COUNT(*) as calls
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL 7 DAY
GROUP BY feature
ORDER BY total_cost DESC

If "code generation" costs 10× more than "text summarization", price them differently.

Latency Optimization

LLM latency has two components: time to first token (TTFT) and tokens per second (TPS).

Time to First Token (TTFT)

Measures how long before the model starts generating. Dominated by:

Prompt length (longer = slower)
Model size (larger = slower)
Server load (queuing time)

Optimization:

Cache common prompts (exact match or semantic cache)
Use smaller models for simple tasks (GPT-4o-mini vs GPT-4o)
Parallelize independent LLM calls

Tokens Per Second (TPS)

Measures generation speed. Determined by model architecture and hardware.

Benchmarks (approximate):

GPT-4o: ~40 tokens/sec
Claude 3.5 Sonnet: ~60 tokens/sec
GPT-4o-mini: ~80 tokens/sec
Llama 3 8B (self-hosted on A100): ~120 tokens/sec

Optimization:

Stream responses to user (don't wait for full completion)
Set max_tokens limits (stop generating at 500 tokens if that's enough)
Use speculative decoding or batch inference for high-throughput scenarios

Latency Monitoring Query

SELECT
    model,
    percentile_cont(0.5) WITHIN GROUP (ORDER BY latency_ms) as p50_latency,
    percentile_cont(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95_latency,
    percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99_latency
FROM llm_logs
WHERE timestamp > NOW() - INTERVAL 1 DAY
GROUP BY model

Prompt Monitoring

Prompts drift. Variables change. Templates get updated. Monitor:

Template Version Tracking

Tag each LLM call with the template version used:

log_llm_call(
    prompt=filled_prompt,
    response=response,
    metadata={
        "template_id": "code_review_v3",
        "template_version": "2026-03-15",
        "variables": {"language": "python", "file": "main.py"}
    }
)

If you roll out a new prompt template, compare metrics before/after:

SELECT template_version, AVG(user_satisfaction) as avg_satisfaction
FROM llm_logs
WHERE template_id = 'code_review'
GROUP BY template_version

Prompt Length Distribution

Plot the distribution of prompt lengths over time:

SELECT DATE(timestamp) as date, AVG(prompt_tokens) as avg_tokens
FROM llm_logs
GROUP BY date
ORDER BY date

If average prompt length increases 50% overnight, someone broke prompt caching or is leaking context.

Outlier Detection

Alert on prompts that are abnormally long or expensive:

SELECT prompt, prompt_tokens, cost_usd
FROM llm_logs
WHERE prompt_tokens > 10000 OR cost_usd > 1.00
ORDER BY cost_usd DESC
LIMIT 50

Investigate why a single call cost $5.

Tool Call Observability

When LLMs call tools (functions), log the tool invocation separately:

def log_tool_call(
    tool_name: str,
    args: dict,
    result: any,
    latency_ms: float,
    success: bool,
    error: str = None
):
    log_entry = {
        "timestamp": time.time(),
        "tool_name": tool_name,
        "args": args,
        "result": result,
        "latency_ms": latency_ms,
        "success": success,
        "error": error
    }
    logger.info(json.dumps(log_entry))

Tool Success Rate

SELECT tool_name, COUNT(*) as calls, AVG(CAST(success AS INT)) as success_rate
FROM tool_logs
GROUP BY tool_name
ORDER BY success_rate ASC

If "search_web" has 40% success rate, the tool is broken or the LLM is calling it wrong.

Tool Latency

SELECT tool_name, AVG(latency_ms) as avg_latency
FROM tool_logs
GROUP BY tool_name
ORDER BY avg_latency DESC

If "fetch_user_data" takes 3 seconds on average, optimize the database query.

User Feedback Loop

Capture explicit feedback (thumbs up/down) and implicit signals (retry, edit, abandon):

class FeedbackEvent:
    user_id: str
    session_id: str
    message_id: str
    feedback_type: str  # "thumbs_up", "thumbs_down", "retry", "edit", "abandon"
    feedback_text: str = None  # optional user comment
    timestamp: float

def log_feedback(event: FeedbackEvent):
    logger.info(json.dumps(event.__dict__))

Satisfaction Rate by Feature

SELECT feature, AVG(CASE WHEN feedback_type = 'thumbs_up' THEN 1.0 ELSE 0.0 END) as satisfaction
FROM feedback_events
GROUP BY feature
ORDER BY satisfaction ASC

Correlate Feedback with Latency

Do slow responses get more thumbs down?

SELECT
    CASE WHEN latency_ms < 1000 THEN 'fast' WHEN latency_ms < 3000 THEN 'medium' ELSE 'slow' END as speed,
    AVG(CASE WHEN feedback_type = 'thumbs_up' THEN 1.0 ELSE 0.0 END) as satisfaction
FROM llm_logs JOIN feedback_events ON llm_logs.message_id = feedback_events.message_id
GROUP BY speed

Debugging Workflow

When users report "bad output", here's the investigation flow:

1. Find the Exact LLM Call

SELECT * FROM llm_logs WHERE session_id = 'abc123' ORDER BY timestamp

Get the full prompt, response, model, and metadata.

2. Replay the Prompt

Re-run the exact prompt against the same model:

response = llm.invoke(original_prompt, model=original_model)

Does it reproduce the issue? LLMs are stochastic, so set temperature=0 for determinism.

3. Check Retrieved Context (for RAG)

If this was a RAG call, inspect what was retrieved:

SELECT * FROM retrieval_logs WHERE session_id = 'abc123'

Were the retrieved chunks relevant? See the RAG evaluation section for diagnosis.

4. Check Tool Calls

SELECT * FROM tool_logs WHERE session_id = 'abc123' ORDER BY timestamp

Did any tools fail? Did the LLM pass malformed args?

5. Compare to Baselines

Run the same query against different models or prompt templates:

results = []
for model in ["gpt-4o", "claude-3-5-sonnet-20250219", "gpt-4o-mini"]:
    response = llm.invoke(prompt, model=model)
    results.append({"model": model, "response": response})

If GPT-4o works but Claude fails, it's a model-specific issue (prompt not robust).

Alerting Strategy

Set up these alerts:

Cost Spike

Alert if hourly LLM spend exceeds 3× the 7-day average:

IF current_hour_spend > 3 × avg_hourly_spend_7d THEN alert

Error Rate Spike

Alert if LLM error rate exceeds 5%:

IF (errors / total_calls) > 0.05 THEN alert

Latency Degradation

Alert if p95 latency exceeds SLA (e.g., 5 seconds):

IF p95_latency > 5000ms THEN alert

Prompt Drift

Alert if average prompt length changes >30%:

IF abs(current_avg_prompt_tokens - baseline_avg_prompt_tokens) / baseline_avg_prompt_tokens > 0.3 THEN alert

The Observability Stack

Recommended tools by use case:

Need	Tool	Notes
LLM call tracing	LangSmith	Best for LangChain apps
Cost tracking	Helicone	Proxy-based, works with any provider
Model evaluation	Braintrust	A/B testing, prompt versioning
OpenTelemetry	OpenLLMetry	Open-source, vendor-neutral
Production monitoring	Arize / WhyLabs	Drift detection, anomaly alerts

Conclusion

LLM observability requires instrumenting three layers: standard request metrics, token-level LLM metrics, and semantic quality metrics. Log every LLM call with full context. Track costs per-user and per-feature. Monitor latency and set SLA alerts. Build feedback loops to correlate user satisfaction with technical metrics. And when debugging, always start with the exact prompt and response—everything else is derivative.

Build Better AI Tools

DevKits provides developer tools for JSON formatting, Base64 encoding, regex testing, and more — all free and privacy-first.

Try DevKits →