Overview
RAG pipelines have 4 failure points. This guide shows you how to isolate exactly where your pipeline is breaking, with production-tested debugging techniques for each stage.
Stage 1: Ingestion Debugging
When to suspect: New documents don't appear in search results, or search quality degrades after data updates.
Checkpoint 1.1: Document Parsing
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("policy.pdf")
docs = loader.load()
# Verify extracted text quality
print(f"Extracted {len(docs)} pages")
for i, doc in enumerate(docs[:3]):
print(f"Page {i}: {len(doc.page_content)} chars")
print(doc.page_content[:200]) # Check for garbled text, encoding issues
Common issues:
- OCR failures: scanned PDFs return empty strings
- Encoding errors: Unicode characters become �
- Table extraction: tables become unreadable text soup
Checkpoint 1.2: Chunking Validation
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# Validate chunks
for i, chunk in enumerate(chunks[:5]):
print(f"\n--- Chunk {i} ---")
print(f"Length: {len(chunk.page_content)} chars")
print(f"Starts with: {chunk.page_content[:50]}")
print(f"Ends with: {chunk.page_content[-50:]}")
# Red flag: chunk ends mid-word or mid-sentence
if not chunk.page_content[-1] in '.!?"\n':
print("⚠️ WARNING: Chunk ends without sentence boundary")
Checkpoint 1.3: Embedding Generation
from langchain.embeddings import OpenAIEmbeddings
import numpy as np
embeddings = OpenAIEmbeddings()
sample_texts = [
"Returns accepted within 30 days of purchase",
"Refunds processed in 5-7 business days"
]
vecs = embeddings.embed_documents(sample_texts)
print(f"Embedding dimension: {len(vecs[0])}")
print(f"Vector norms: {[np.linalg.norm(v) for v in vecs]}") # Should be ~1.0 for normalized
# Sanity check: similar texts should have high cosine similarity
similarity = np.dot(vecs[0], vecs[1]) / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
print(f"Similarity: {similarity:.3f}") # Expect > 0.7 for semantically similar texts
Checkpoint 1.4: Vector Store Sync
# Compare database vs vector store counts
db_count = session.query(Document).count()
vectorstore_count = vectorstore.index.ntotal
if db_count != vectorstore_count:
print(f"⚠️ Index out of sync: {vectorstore_count} vectors vs {db_count} documents")
# Find missing documents
db_ids = {doc.id for doc in session.query(Document.id).all()}
vector_ids = set(vectorstore.get_all_ids()) # Implementation-specific
missing = db_ids - vector_ids
print(f"Missing from vector store: {missing}")
Stage 2: Retrieval Debugging
When to suspect: LLM says "I don't have that information" despite documents existing.
Checkpoint 2.1: Query Embedding
query = "What is the refund policy?"
query_vec = embeddings.embed_query(query)
print(f"Query embedding dimension: {len(query_vec)}")
print(f"Query norm: {np.linalg.norm(query_vec):.3f}")
# Compare to ground truth document
ground_truth_doc = "Returns accepted within 30 days"
doc_vec = embeddings.embed_documents([ground_truth_doc])[0]
similarity = np.dot(query_vec, doc_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(doc_vec))
print(f"Query-doc similarity: {similarity:.3f}") # Should be > 0.5
Checkpoint 2.2: Top-K Results
results = vectorstore.similarity_search_with_score(query, k=10)
print(f"Retrieved {len(results)} results")
for i, (doc, score) in enumerate(results):
print(f"\nRank {i+1} | Score: {score:.3f}")
print(f"Metadata: {doc.metadata}")
print(f"Content preview: {doc.page_content[:100]}...")
# Red flag: relevant documents appear below position 5
# or similarity scores < 0.3
Checkpoint 2.3: Metadata Filtering
# Test filter isolation
results_unfiltered = vectorstore.similarity_search(query, k=10)
results_filtered = vectorstore.similarity_search(
query,
k=10,
filter={"tenant_id": "acme-corp"}
)
print(f"Unfiltered: {len(results_unfiltered)} results")
print(f"Filtered: {len(results_filtered)} results")
if len(results_filtered) == 0:
# Filter is too restrictive or metadata not indexed
print("⚠️ Filter returning zero results - check metadata indexing")
Stage 3: Reranking Debugging
When to suspect: Retrieved documents look relevant but final answer is wrong or incomplete.
Checkpoint 3.1: Reranker Scores
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What is the return policy?"
candidates = [doc.page_content for doc in results[:10]]
scores = reranker.predict([(query, cand) for cand in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
for i, (text, score) in enumerate(ranked[:5]):
print(f"\nReranked #{i+1} | Score: {score:.3f}")
print(text[:100])
Checkpoint 3.2: Context Compression
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
context = "\n\n".join(doc.page_content for doc in ranked[:5])
token_count = len(encoding.encode(context))
print(f"Context tokens: {token_count}")
print(f"Max tokens: 8192")
print(f"Utilization: {token_count / 8192 * 100:.1f}%")
if token_count > 5700: # 70% of 8K context window
print("⚠️ Context exceeds 70% of window - truncation likely")
Stage 4: Generation Debugging
When to suspect: Retrieved context looks correct but LLM output is wrong, generic, or ignores context.
Checkpoint 4.1: Prompt Inspection
# Reconstruct exact prompt sent to LLM
system_prompt = "Answer using ONLY the provided context. If unsure, say 'I don't know.'"
user_prompt = f"""Context:
{context}
Question: {query}
Answer:"""
print("=== FULL PROMPT ===")
print(system_prompt)
print(user_prompt)
print(f"\nTotal prompt tokens: {len(encoding.encode(system_prompt + user_prompt))}")
Checkpoint 4.2: Context Usage
from langchain.llms import OpenAI
llm = OpenAI(temperature=0)
answer = llm(user_prompt)
# Check if answer uses retrieved context
retrieved_phrases = set(context.split()[:20]) # First 20 words of context
answer_words = set(answer.split())
overlap = retrieved_phrases & answer_words
print(f"Context-answer overlap: {len(overlap)} / {len(retrieved_phrases)} words")
if len(overlap) < 3:
print("⚠️ LLM did not use retrieved context - check system prompt")
Checkpoint 4.3: Hallucination Detection
# Compare answer claims against source chunks
def check_hallucination(answer, source_chunks):
"""Check if answer contains info not in sources"""
# Use NLI model or GPT-4 for entailment check
hallucinations = []
for sentence in answer.split('.'):
sentence = sentence.strip()
if not sentence:
continue
found_in_source = any(
sentence.lower() in chunk.lower()
for chunk in source_chunks
)
if not found_in_source:
hallucinations.append(sentence)
return hallucinations
hallucinated = check_hallucination(answer, [doc.page_content for doc in results[:5]])
if hallucinated:
print(f"⚠️ Potential hallucinations: {hallucinated}")
Stop debugging RAG with print statements. RAG Debugger gives you:
- 🎯 Stage-by-stage waterfall — Pinpoint exactly where pipeline fails
- 📊 Live metrics dashboard — Track retrieval@k, reranker scores, token usage
- 🔬 A/B testing UI — Compare chunk sizes, embeddings, prompts side-by-side
- 🚀 Production monitoring — Alert on accuracy drops, latency spikes
FAQ
How do I know which stage is failing?
Work backwards: start at Stage 4 (generation). If prompt looks correct, check Stage 3 (reranking). If reranked docs are good, check Stage 2 (retrieval). If retrieval returns zero results, check Stage 1 (ingestion).
What's the fastest way to debug in production?
Add structured logging at each stage boundary. Log: query → retrieved doc IDs → reranked doc IDs → final answer. Store in ELK/Datadog for analysis.
Should I debug in notebooks or production?
Both. Use notebooks to reproduce failures with sample queries. Use production logs to identify patterns (e.g., "20% of queries fail on date range filters").
How do I measure RAG quality over time?
Build a golden dataset of query-answer pairs. Run nightly tests and track: retrieval precision@k, answer accuracy (BLEU/ROUGE), user satisfaction (thumbs up/down).