Your RAG pipeline is returning wrong answers. But where exactly is it breaking?
Is the query embedding wrong? Did retrieval miss relevant docs? Did the LLM ignore the context? Without a systematic debugging approach, you're guessing in the dark.
This guide provides a 4-stage diagnostic workflow you can apply to any RAG pipeline — LangChain, LlamaIndex, or custom implementations.
🔧 Quick Start: Visual Debugging
Use rag-debugger.pages.dev to visualize your entire RAG pipeline. Paste your query, retrieved chunks, and LLM response to see exactly where failures occur. Free: 10 debug sessions/month.
The 4-Stage RAG Debugging Workflow
Every RAG pipeline has 4 stages. Debug them in order:
- Query → Embedding (Is the query represented correctly?)
- Embedding → Retrieval (Are relevant docs found?)
- Retrieval → Context Assembly (Is context formatted correctly?)
- Context → LLM Response (Does the model use context properly?)
Let's walk through each stage with concrete diagnostics.
Stage 1: Query → Embedding Debugging
Goal: Verify the query embedding captures semantic meaning.
Diagnostic Checklist
Debug Code
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('BAAI/bge-m3')
# Test query
query = "How do I authenticate users?"
query_embedding = model.encode([query])[0]
# Check against sample docs
docs = [
"User authentication with OAuth 2.0 and JWT tokens",
"Database connection pooling configuration",
"Setting up login forms with React"
]
doc_embeddings = model.encode(docs)
# Calculate similarities
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]
for i, (doc, sim) in enumerate(zip(docs, similarities)):
print(f"{sim:.3f} - {doc}")
# Expected: High sim (>0.7) for auth-related docs
# Red flag: All sims < 0.5 → query embedding issue
Common Issues & Fixes
| Symptom | Likely Cause | Fix |
|---|---|---|
| All similarities < 0.5 | Query too short or wrong model | Expand query + switch to domain model |
| Unrelated docs score high | Preprocessing mismatch | Align query/doc preprocessing |
Stage 2: Embedding → Retrieval Debugging
Goal: Verify retrieval returns relevant documents.
Diagnostic Checklist
Debug Code
def debug_retrieval(query: str, retriever, expected_docs: list = None):
"""Debug retrieval step with optional ground truth"""
# Get results
results = retriever.search(query, k=10)
print(f"Query: {query}\n")
print(f"{'Score':<8} {'Document':<60}")
print("-" * 70)
for r in results:
print(f"{r.score:.3f} {r.content[:60]}...")
# If we have ground truth, check recall
if expected_docs:
retrieved_ids = {r.id for r in results}
expected_ids = set(expected_docs)
recall = len(retrieved_ids & expected_ids) / len(expected_ids)
print(f"\nRecall@{len(results)}: {recall:.2%}")
if recall < 0.8:
missing = expected_ids - retrieved_ids
print(f"Missing docs: {missing}")
# Check score distribution
scores = [r.score for r in results]
print(f"\nScore stats: min={min(scores):.3f}, max={max(scores):.3f}, avg={np.mean(scores):.3f}")
return results
# Usage
debug_retrieval(query, vectorstore.as_retriever(), expected_docs=["doc_42", "doc_156"])
Stage 3: Retrieval → Context Assembly Debugging
Goal: Verify context is formatted correctly for the LLM.
Diagnostic Checklist
Debug Code
import tiktoken
def debug_context_assembly(query: str, chunks: list, llm, max_tokens: int = 4000):
"""Debug context assembly step"""
encoding = tiktoken.encoding_for_model(llm.model_name)
# Check for duplicates
content_hashes = [hash(c.content) for c in chunks]
duplicates = len(content_hashes) - len(set(content_hashes))
if duplicates > 0:
print(f"⚠️ {duplicates} duplicate chunks detected")
# Check token count
context_text = "\n\n".join([f"[Source: {c.metadata['source']}]\n{c.content}" for c in chunks])
prompt = f"Context:\n{context_text}\n\nQuestion: {query}"
token_count = len(encoding.encode(prompt))
print(f"Context tokens: {token_count}/{max_tokens} ({token_count/max_tokens:.1%})")
if token_count > max_tokens:
print(f"⚠️ Context exceeds limit! Need to truncate or re-rank.")
# Check chunk ordering
print(f"\nChunk order (by relevance score):")
for i, c in enumerate(chunks[:5]):
print(f" {i+1}. Score={c.score:.3f} - {c.metadata['source']}")
# Show actual prompt sent to LLM
print(f"\n{'='*60}")
print("PROMPT SENT TO LLM (first 500 chars):")
print(f"{'='*60}")
print(prompt[:500] + "..." if len(prompt) > 500 else prompt)
return prompt
Stage 4: Context → LLM Response Debugging
Goal: Verify LLM uses context correctly.
Diagnostic Checklist
Debug Code
def debug_llm_response(query: str, context: str, response: str, chunks: list):
"""Debug LLM response for hallucination and grounding"""
print("LLM RESPONSE ANALYSIS")
print("=" * 60)
print(response)
print("=" * 60)
# Extract claims (simplified - use NER in production)
sentences = response.split('.')
print(f"\nChecking {len(sentences)} claims against context...\n")
for sent in sentences:
if len(sent.strip()) < 10:
continue
# Check if claim is supported by any chunk
max_sim = 0
for chunk in chunks:
sim = cosine_similarity(
[embed(sent)],
[embed(chunk.content)]
)[0][0]
max_sim = max(max_sim, sim)
status = "✓" if max_sim > 0.7 else "⚠️ POTENTIAL HALLUCINATION"
print(f"{status} {sent.strip()[:80]}...")
print(f" Max similarity to context: {max_sim:.3f}\n")
# Check for "I don't know" when appropriate
if "not found in the context" in response.lower() or "don't have enough information" in response.lower():
print("✓ Model correctly indicates uncertainty")
# Check for citations
if "[" not in response or "]" not in response:
print("⚠️ No citations found - model may be hallucinating")
Production Debugging Patterns
Pattern 1: Shadow Mode Debugging
class DebuggableRAG:
def __init__(self, retriever, llm, debug_mode: bool = False):
self.retriever = retriever
self.llm = llm
self.debug_mode = debug_mode
self.debug_log = []
def query(self, question: str) -> dict:
start_time = time.time()
# Stage 1: Retrieve
chunks = self.retriever.search(question, k=10)
# Stage 2: Assemble context
context = self._assemble_context(chunks)
# Stage 3: Generate
response = self.llm.invoke(f"Context: {context}\n\nQuestion: {question}")
# Debug logging
if self.debug_mode:
self.debug_log.append({
"query": question,
"retrieved_chunks": len(chunks),
"context_tokens": len(context),
"response": response,
"latency_ms": (time.time() - start_time) * 1000,
"timestamp": datetime.now().isoformat()
})
return {
"answer": response,
"debug": {
"chunks": [c.content for c in chunks],
"scores": [c.score for c in chunks],
"context": context,
"latency_ms": (time.time() - start_time) * 1000
} if self.debug_mode else None
}
Pattern 2: Golden Set Evaluation
import json
from dataclasses import dataclass
@dataclass
class EvalQuery:
query: str
expected_answer: str
expected_sources: list # Document IDs that should be retrieved
def evaluate_rag(rag, eval_set: list[EvalQuery]) -> dict:
"""Evaluate RAG against golden set"""
results = []
for eq in eval_set:
response = rag.query(eq.query, debug_mode=True)
# Check retrieval recall
retrieved_ids = [c.metadata['id'] for c in response['debug']['chunks']]
recall = len(set(retrieved_ids) & set(eq.expected_sources)) / len(eq.expected_sources)
# Check answer similarity (use embedding similarity as proxy)
answer_sim = cosine_similarity(
[embed(response['answer'])],
[embed(eq.expected_answer)]
)[0][0]
results.append({
"query": eq.query,
"recall": recall,
"answer_similarity": answer_sim,
"latency_ms": response['debug']['latency_ms']
})
# Aggregate metrics
return {
"avg_recall": np.mean([r['recall'] for r in results]),
"avg_answer_similarity": np.mean([r['answer_similarity'] for r in results]),
"avg_latency_ms": np.mean([r['latency_ms'] for r in results]),
"p95_latency_ms": np.percentile([r['latency_ms'] for r in results], 95),
"details": results
}
# Usage
eval_results = evaluate_rag(rag, golden_set)
print(f"Recall@10: {eval_results['avg_recall']:.2%}")
print(f"Answer Similarity: {eval_results['avg_answer_similarity']:.3f}")
🚀 Debug Your RAG in Minutes
RAG Debugger automates this entire workflow:
- 4-stage diagnostic dashboard
- Automatic hallucination detection
- Side-by-side chunk comparison
- Export debug reports for team review
Try 10 free debug sessions → rag-debugger.pages.dev
Quick Reference: Common Issues & Fixes
| Issue | Stage | Quick Fix |
|---|---|---|
| Low similarity scores | Query→Embed | Expand query + better model |
| Wrong docs retrieved | Embed→Retrieve | Add hybrid search + reranking |
| Context too long | Retrieve→Context | Re-rank + compress to top 5 |
| Hallucinated answers | Context→Response | Ground prompt + citation requirement |
Conclusion
RAG debugging becomes manageable when you follow a systematic workflow:
- Query → Embedding: Verify semantic representation
- Embedding → Retrieval: Check recall and relevance
- Retrieval → Context: Validate formatting and token count
- Context → Response: Detect hallucination and grounding issues
For faster debugging, try RAG Debugger — a visual tool that automates this entire workflow. Start with 10 free sessions at rag-debugger.pages.dev.