RAG Pipeline Troubleshooting: A Step-by-Step Debugging Guide

A systematic 4-stage diagnostic workflow for production RAG systems

Published: March 13, 202610 min read

Your RAG pipeline is returning wrong answers. But where exactly is it breaking?

Is the query embedding wrong? Did retrieval miss relevant docs? Did the LLM ignore the context? Without a systematic debugging approach, you're guessing in the dark.

This guide provides a 4-stage diagnostic workflow you can apply to any RAG pipeline — LangChain, LlamaIndex, or custom implementations.

🔧 Quick Start: Visual Debugging

Use rag-debugger.pages.dev to visualize your entire RAG pipeline. Paste your query, retrieved chunks, and LLM response to see exactly where failures occur. Free: 10 debug sessions/month.

The 4-Stage RAG Debugging Workflow

Every RAG pipeline has 4 stages. Debug them in order:

  1. Query → Embedding (Is the query represented correctly?)
  2. Embedding → Retrieval (Are relevant docs found?)
  3. Retrieval → Context Assembly (Is context formatted correctly?)
  4. Context → LLM Response (Does the model use context properly?)

Let's walk through each stage with concrete diagnostics.

Stage 1: Query → Embedding Debugging

Goal: Verify the query embedding captures semantic meaning.

Diagnostic Checklist

Debug Code

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('BAAI/bge-m3')

# Test query
query = "How do I authenticate users?"
query_embedding = model.encode([query])[0]

# Check against sample docs
docs = [
    "User authentication with OAuth 2.0 and JWT tokens",
    "Database connection pooling configuration",
    "Setting up login forms with React"
]
doc_embeddings = model.encode(docs)

# Calculate similarities
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

for i, (doc, sim) in enumerate(zip(docs, similarities)):
    print(f"{sim:.3f} - {doc}")

# Expected: High sim (>0.7) for auth-related docs
# Red flag: All sims < 0.5 → query embedding issue

Common Issues & Fixes

Symptom Likely Cause Fix
All similarities < 0.5 Query too short or wrong model Expand query + switch to domain model
Unrelated docs score high Preprocessing mismatch Align query/doc preprocessing

Stage 2: Embedding → Retrieval Debugging

Goal: Verify retrieval returns relevant documents.

Diagnostic Checklist

Debug Code

def debug_retrieval(query: str, retriever, expected_docs: list = None):
    """Debug retrieval step with optional ground truth"""

    # Get results
    results = retriever.search(query, k=10)

    print(f"Query: {query}\n")
    print(f"{'Score':<8} {'Document':<60}")
    print("-" * 70)

    for r in results:
        print(f"{r.score:.3f}    {r.content[:60]}...")

    # If we have ground truth, check recall
    if expected_docs:
        retrieved_ids = {r.id for r in results}
        expected_ids = set(expected_docs)
        recall = len(retrieved_ids & expected_ids) / len(expected_ids)
        print(f"\nRecall@{len(results)}: {recall:.2%}")

        if recall < 0.8:
            missing = expected_ids - retrieved_ids
            print(f"Missing docs: {missing}")

    # Check score distribution
    scores = [r.score for r in results]
    print(f"\nScore stats: min={min(scores):.3f}, max={max(scores):.3f}, avg={np.mean(scores):.3f}")

    return results

# Usage
debug_retrieval(query, vectorstore.as_retriever(), expected_docs=["doc_42", "doc_156"])

Stage 3: Retrieval → Context Assembly Debugging

Goal: Verify context is formatted correctly for the LLM.

Diagnostic Checklist

Debug Code

import tiktoken

def debug_context_assembly(query: str, chunks: list, llm, max_tokens: int = 4000):
    """Debug context assembly step"""

    encoding = tiktoken.encoding_for_model(llm.model_name)

    # Check for duplicates
    content_hashes = [hash(c.content) for c in chunks]
    duplicates = len(content_hashes) - len(set(content_hashes))
    if duplicates > 0:
        print(f"⚠️  {duplicates} duplicate chunks detected")

    # Check token count
    context_text = "\n\n".join([f"[Source: {c.metadata['source']}]\n{c.content}" for c in chunks])
    prompt = f"Context:\n{context_text}\n\nQuestion: {query}"
    token_count = len(encoding.encode(prompt))

    print(f"Context tokens: {token_count}/{max_tokens} ({token_count/max_tokens:.1%})")

    if token_count > max_tokens:
        print(f"⚠️  Context exceeds limit! Need to truncate or re-rank.")

    # Check chunk ordering
    print(f"\nChunk order (by relevance score):")
    for i, c in enumerate(chunks[:5]):
        print(f"  {i+1}. Score={c.score:.3f} - {c.metadata['source']}")

    # Show actual prompt sent to LLM
    print(f"\n{'='*60}")
    print("PROMPT SENT TO LLM (first 500 chars):")
    print(f"{'='*60}")
    print(prompt[:500] + "..." if len(prompt) > 500 else prompt)

    return prompt

Stage 4: Context → LLM Response Debugging

Goal: Verify LLM uses context correctly.

Diagnostic Checklist

Debug Code

def debug_llm_response(query: str, context: str, response: str, chunks: list):
    """Debug LLM response for hallucination and grounding"""

    print("LLM RESPONSE ANALYSIS")
    print("=" * 60)
    print(response)
    print("=" * 60)

    # Extract claims (simplified - use NER in production)
    sentences = response.split('.')
    print(f"\nChecking {len(sentences)} claims against context...\n")

    for sent in sentences:
        if len(sent.strip()) < 10:
            continue

        # Check if claim is supported by any chunk
        max_sim = 0
        for chunk in chunks:
            sim = cosine_similarity(
                [embed(sent)],
                [embed(chunk.content)]
            )[0][0]
            max_sim = max(max_sim, sim)

        status = "✓" if max_sim > 0.7 else "⚠️  POTENTIAL HALLUCINATION"
        print(f"{status} {sent.strip()[:80]}...")
        print(f"    Max similarity to context: {max_sim:.3f}\n")

    # Check for "I don't know" when appropriate
    if "not found in the context" in response.lower() or "don't have enough information" in response.lower():
        print("✓ Model correctly indicates uncertainty")

    # Check for citations
    if "[" not in response or "]" not in response:
        print("⚠️  No citations found - model may be hallucinating")

Production Debugging Patterns

Pattern 1: Shadow Mode Debugging

class DebuggableRAG:
    def __init__(self, retriever, llm, debug_mode: bool = False):
        self.retriever = retriever
        self.llm = llm
        self.debug_mode = debug_mode
        self.debug_log = []

    def query(self, question: str) -> dict:
        start_time = time.time()

        # Stage 1: Retrieve
        chunks = self.retriever.search(question, k=10)

        # Stage 2: Assemble context
        context = self._assemble_context(chunks)

        # Stage 3: Generate
        response = self.llm.invoke(f"Context: {context}\n\nQuestion: {question}")

        # Debug logging
        if self.debug_mode:
            self.debug_log.append({
                "query": question,
                "retrieved_chunks": len(chunks),
                "context_tokens": len(context),
                "response": response,
                "latency_ms": (time.time() - start_time) * 1000,
                "timestamp": datetime.now().isoformat()
            })

        return {
            "answer": response,
            "debug": {
                "chunks": [c.content for c in chunks],
                "scores": [c.score for c in chunks],
                "context": context,
                "latency_ms": (time.time() - start_time) * 1000
            } if self.debug_mode else None
        }

Pattern 2: Golden Set Evaluation

import json
from dataclasses import dataclass

@dataclass
class EvalQuery:
    query: str
    expected_answer: str
    expected_sources: list  # Document IDs that should be retrieved

def evaluate_rag(rag, eval_set: list[EvalQuery]) -> dict:
    """Evaluate RAG against golden set"""

    results = []
    for eq in eval_set:
        response = rag.query(eq.query, debug_mode=True)

        # Check retrieval recall
        retrieved_ids = [c.metadata['id'] for c in response['debug']['chunks']]
        recall = len(set(retrieved_ids) & set(eq.expected_sources)) / len(eq.expected_sources)

        # Check answer similarity (use embedding similarity as proxy)
        answer_sim = cosine_similarity(
            [embed(response['answer'])],
            [embed(eq.expected_answer)]
        )[0][0]

        results.append({
            "query": eq.query,
            "recall": recall,
            "answer_similarity": answer_sim,
            "latency_ms": response['debug']['latency_ms']
        })

    # Aggregate metrics
    return {
        "avg_recall": np.mean([r['recall'] for r in results]),
        "avg_answer_similarity": np.mean([r['answer_similarity'] for r in results]),
        "avg_latency_ms": np.mean([r['latency_ms'] for r in results]),
        "p95_latency_ms": np.percentile([r['latency_ms'] for r in results], 95),
        "details": results
    }

# Usage
eval_results = evaluate_rag(rag, golden_set)
print(f"Recall@10: {eval_results['avg_recall']:.2%}")
print(f"Answer Similarity: {eval_results['avg_answer_similarity']:.3f}")

🚀 Debug Your RAG in Minutes

RAG Debugger automates this entire workflow:

Try 10 free debug sessions → rag-debugger.pages.dev

Quick Reference: Common Issues & Fixes

Issue Stage Quick Fix
Low similarity scores Query→Embed Expand query + better model
Wrong docs retrieved Embed→Retrieve Add hybrid search + reranking
Context too long Retrieve→Context Re-rank + compress to top 5
Hallucinated answers Context→Response Ground prompt + citation requirement

Conclusion

RAG debugging becomes manageable when you follow a systematic workflow:

  1. Query → Embedding: Verify semantic representation
  2. Embedding → Retrieval: Check recall and relevance
  3. Retrieval → Context: Validate formatting and token count
  4. Context → Response: Detect hallucination and grounding issues

For faster debugging, try RAG Debugger — a visual tool that automates this entire workflow. Start with 10 free sessions at rag-debugger.pages.dev.