RAG Pipeline Troubleshooting: 4-Stage Debugging Workflow

Systematic approach to debugging RAG pipelines: isolate whether failure is in ingestion, retrieval, reranking, or generation.

Overview

RAG pipelines have 4 failure points. This guide shows you how to isolate exactly where your pipeline is breaking, with production-tested debugging techniques for each stage.

Stage 1: Ingestion Debugging

When to suspect: New documents don't appear in search results, or search quality degrades after data updates.

Checkpoint 1.1: Document Parsing

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("policy.pdf")
docs = loader.load()

# Verify extracted text quality
print(f"Extracted {len(docs)} pages")
for i, doc in enumerate(docs[:3]):
    print(f"Page {i}: {len(doc.page_content)} chars")
    print(doc.page_content[:200])  # Check for garbled text, encoding issues

Common issues:

  • OCR failures: scanned PDFs return empty strings
  • Encoding errors: Unicode characters become �
  • Table extraction: tables become unreadable text soup

Checkpoint 1.2: Chunking Validation

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# Validate chunks
for i, chunk in enumerate(chunks[:5]):
    print(f"\n--- Chunk {i} ---")
    print(f"Length: {len(chunk.page_content)} chars")
    print(f"Starts with: {chunk.page_content[:50]}")
    print(f"Ends with: {chunk.page_content[-50:]}")
    
    # Red flag: chunk ends mid-word or mid-sentence
    if not chunk.page_content[-1] in '.!?"\n':
        print("⚠️ WARNING: Chunk ends without sentence boundary")

Checkpoint 1.3: Embedding Generation

from langchain.embeddings import OpenAIEmbeddings
import numpy as np

embeddings = OpenAIEmbeddings()
sample_texts = [
    "Returns accepted within 30 days of purchase",
    "Refunds processed in 5-7 business days"
]

vecs = embeddings.embed_documents(sample_texts)
print(f"Embedding dimension: {len(vecs[0])}")
print(f"Vector norms: {[np.linalg.norm(v) for v in vecs]}")  # Should be ~1.0 for normalized

# Sanity check: similar texts should have high cosine similarity
similarity = np.dot(vecs[0], vecs[1]) / (np.linalg.norm(vecs[0]) * np.linalg.norm(vecs[1]))
print(f"Similarity: {similarity:.3f}")  # Expect > 0.7 for semantically similar texts

Checkpoint 1.4: Vector Store Sync

# Compare database vs vector store counts
db_count = session.query(Document).count()
vectorstore_count = vectorstore.index.ntotal

if db_count != vectorstore_count:
    print(f"⚠️ Index out of sync: {vectorstore_count} vectors vs {db_count} documents")
    
    # Find missing documents
    db_ids = {doc.id for doc in session.query(Document.id).all()}
    vector_ids = set(vectorstore.get_all_ids())  # Implementation-specific
    missing = db_ids - vector_ids
    print(f"Missing from vector store: {missing}")

Stage 2: Retrieval Debugging

When to suspect: LLM says "I don't have that information" despite documents existing.

Checkpoint 2.1: Query Embedding

query = "What is the refund policy?"
query_vec = embeddings.embed_query(query)

print(f"Query embedding dimension: {len(query_vec)}")
print(f"Query norm: {np.linalg.norm(query_vec):.3f}")

# Compare to ground truth document
ground_truth_doc = "Returns accepted within 30 days"
doc_vec = embeddings.embed_documents([ground_truth_doc])[0]
similarity = np.dot(query_vec, doc_vec) / (np.linalg.norm(query_vec) * np.linalg.norm(doc_vec))
print(f"Query-doc similarity: {similarity:.3f}")  # Should be > 0.5

Checkpoint 2.2: Top-K Results

results = vectorstore.similarity_search_with_score(query, k=10)

print(f"Retrieved {len(results)} results")
for i, (doc, score) in enumerate(results):
    print(f"\nRank {i+1} | Score: {score:.3f}")
    print(f"Metadata: {doc.metadata}")
    print(f"Content preview: {doc.page_content[:100]}...")
    
# Red flag: relevant documents appear below position 5
# or similarity scores < 0.3

Checkpoint 2.3: Metadata Filtering

# Test filter isolation
results_unfiltered = vectorstore.similarity_search(query, k=10)
results_filtered = vectorstore.similarity_search(
    query, 
    k=10,
    filter={"tenant_id": "acme-corp"}
)

print(f"Unfiltered: {len(results_unfiltered)} results")
print(f"Filtered: {len(results_filtered)} results")

if len(results_filtered) == 0:
    # Filter is too restrictive or metadata not indexed
    print("⚠️ Filter returning zero results - check metadata indexing")

Stage 3: Reranking Debugging

When to suspect: Retrieved documents look relevant but final answer is wrong or incomplete.

Checkpoint 3.1: Reranker Scores

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
query = "What is the return policy?"
candidates = [doc.page_content for doc in results[:10]]

scores = reranker.predict([(query, cand) for cand in candidates])
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

for i, (text, score) in enumerate(ranked[:5]):
    print(f"\nReranked #{i+1} | Score: {score:.3f}")
    print(text[:100])

Checkpoint 3.2: Context Compression

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
context = "\n\n".join(doc.page_content for doc in ranked[:5])
token_count = len(encoding.encode(context))

print(f"Context tokens: {token_count}")
print(f"Max tokens: 8192")
print(f"Utilization: {token_count / 8192 * 100:.1f}%")

if token_count > 5700:  # 70% of 8K context window
    print("⚠️ Context exceeds 70% of window - truncation likely")

Stage 4: Generation Debugging

When to suspect: Retrieved context looks correct but LLM output is wrong, generic, or ignores context.

Checkpoint 4.1: Prompt Inspection

# Reconstruct exact prompt sent to LLM
system_prompt = "Answer using ONLY the provided context. If unsure, say 'I don't know.'"
user_prompt = f"""Context:
{context}

Question: {query}

Answer:"""

print("=== FULL PROMPT ===")
print(system_prompt)
print(user_prompt)
print(f"\nTotal prompt tokens: {len(encoding.encode(system_prompt + user_prompt))}")

Checkpoint 4.2: Context Usage

from langchain.llms import OpenAI

llm = OpenAI(temperature=0)
answer = llm(user_prompt)

# Check if answer uses retrieved context
retrieved_phrases = set(context.split()[:20])  # First 20 words of context
answer_words = set(answer.split())

overlap = retrieved_phrases & answer_words
print(f"Context-answer overlap: {len(overlap)} / {len(retrieved_phrases)} words")

if len(overlap) < 3:
    print("⚠️ LLM did not use retrieved context - check system prompt")

Checkpoint 4.3: Hallucination Detection

# Compare answer claims against source chunks
def check_hallucination(answer, source_chunks):
    """Check if answer contains info not in sources"""
    # Use NLI model or GPT-4 for entailment check
    hallucinations = []
    
    for sentence in answer.split('.'):
        sentence = sentence.strip()
        if not sentence:
            continue
            
        found_in_source = any(
            sentence.lower() in chunk.lower() 
            for chunk in source_chunks
        )
        
        if not found_in_source:
            hallucinations.append(sentence)
    
    return hallucinations

hallucinated = check_hallucination(answer, [doc.page_content for doc in results[:5]])
if hallucinated:
    print(f"⚠️ Potential hallucinations: {hallucinated}")
🛠️ RAG Debugger — Debug Faster with Visual Tools

Stop debugging RAG with print statements. RAG Debugger gives you:

  • 🎯 Stage-by-stage waterfall — Pinpoint exactly where pipeline fails
  • 📊 Live metrics dashboard — Track retrieval@k, reranker scores, token usage
  • 🔬 A/B testing UI — Compare chunk sizes, embeddings, prompts side-by-side
  • 🚀 Production monitoring — Alert on accuracy drops, latency spikes
Start Free Trial →

FAQ

How do I know which stage is failing?

Work backwards: start at Stage 4 (generation). If prompt looks correct, check Stage 3 (reranking). If reranked docs are good, check Stage 2 (retrieval). If retrieval returns zero results, check Stage 1 (ingestion).

What's the fastest way to debug in production?

Add structured logging at each stage boundary. Log: query → retrieved doc IDs → reranked doc IDs → final answer. Store in ELK/Datadog for analysis.

Should I debug in notebooks or production?

Both. Use notebooks to reproduce failures with sample queries. Use production logs to identify patterns (e.g., "20% of queries fail on date range filters").

How do I measure RAG quality over time?

Build a golden dataset of query-answer pairs. Run nightly tests and track: retrieval precision@k, answer accuracy (BLEU/ROUGE), user satisfaction (thumbs up/down).