RAG Pipeline Troubleshooting: A Step-by-Step Debugging Guide

A systematic 4-stage diagnostic workflow for production RAG systems

Published: March 13, 2026 • 10 min read

Your RAG pipeline is returning wrong answers. But where exactly is it breaking?

Is the query embedding wrong? Did retrieval miss relevant docs? Did the LLM ignore the context? Without a systematic debugging approach, you're guessing in the dark.

This guide provides a 4-stage diagnostic workflow you can apply to any RAG pipeline — LangChain, LlamaIndex, or custom implementations.

🔧 Quick Start: Visual Debugging

Use rag-debugger.pages.dev to visualize your entire RAG pipeline. Paste your query, retrieved chunks, and LLM response to see exactly where failures occur. Free: 10 debug sessions/month.

The 4-Stage RAG Debugging Workflow

Every RAG pipeline has 4 stages. Debug them in order:

Query → Embedding (Is the query represented correctly?)
Embedding → Retrieval (Are relevant docs found?)
Retrieval → Context Assembly (Is context formatted correctly?)
Context → LLM Response (Does the model use context properly?)

Let's walk through each stage with concrete diagnostics.

Stage 1: Query → Embedding Debugging

Goal: Verify the query embedding captures semantic meaning.

Diagnostic Checklist

Query preprocessing is consistent

Same lowercasing, trimming, special char handling for queries and docs

Embedding model matches domain

Technical docs → BGE-M3/E5-Mistral, not text-embedding-3-small

Query length is adequate

5+ tokens for semantic search. Expand vague queries.

Debug Code

from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('BAAI/bge-m3')

# Test query
query = "How do I authenticate users?"
query_embedding = model.encode([query])[0]

# Check against sample docs
docs = [
    "User authentication with OAuth 2.0 and JWT tokens",
    "Database connection pooling configuration",
    "Setting up login forms with React"
]
doc_embeddings = model.encode(docs)

# Calculate similarities
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

for i, (doc, sim) in enumerate(zip(docs, similarities)):
    print(f"{sim:.3f} - {doc}")

# Expected: High sim (>0.7) for auth-related docs
# Red flag: All sims < 0.5 → query embedding issue

Common Issues & Fixes

Symptom	Likely Cause	Fix
All similarities < 0.5	Query too short or wrong model	Expand query + switch to domain model
Unrelated docs score high	Preprocessing mismatch	Align query/doc preprocessing

Stage 2: Embedding → Retrieval Debugging

Goal: Verify retrieval returns relevant documents.

Diagnostic Checklist

Top-K is appropriate

Start with k=5-10. Too low = miss context. Too high = noise.

Similarity threshold is set

Filter out results < 0.5 similarity to avoid garbage

Metadata filters are correct

Version, date, source filters don't exclude relevant docs

Index is up to date

Recent docs are embedded and added to vector store

Debug Code

def debug_retrieval(query: str, retriever, expected_docs: list = None):
    """Debug retrieval step with optional ground truth"""

    # Get results
    results = retriever.search(query, k=10)

    print(f"Query: {query}\n")
    print(f"{'Score':<8} {'Document':<60}")
    print("-" * 70)

    for r in results:
        print(f"{r.score:.3f}    {r.content[:60]}...")

    # If we have ground truth, check recall
    if expected_docs:
        retrieved_ids = {r.id for r in results}
        expected_ids = set(expected_docs)
        recall = len(retrieved_ids & expected_ids) / len(expected_ids)
        print(f"\nRecall@{len(results)}: {recall:.2%}")

        if recall < 0.8:
            missing = expected_ids - retrieved_ids
            print(f"Missing docs: {missing}")

    # Check score distribution
    scores = [r.score for r in results]
    print(f"\nScore stats: min={min(scores):.3f}, max={max(scores):.3f}, avg={np.mean(scores):.3f}")

    return results

# Usage
debug_retrieval(query, vectorstore.as_retriever(), expected_docs=["doc_42", "doc_156"])

Stage 3: Retrieval → Context Assembly Debugging

Goal: Verify context is formatted correctly for the LLM.

Diagnostic Checklist

Context fits within token limit

Total tokens < 80% of model's context window

Chunks are ordered logically

Relevance order or document order, not random

Source attribution is preserved

Each chunk has [Source: filename] for citation

No duplicate chunks

Deduplicate by content hash before sending to LLM

Debug Code

import tiktoken

def debug_context_assembly(query: str, chunks: list, llm, max_tokens: int = 4000):
    """Debug context assembly step"""

    encoding = tiktoken.encoding_for_model(llm.model_name)

    # Check for duplicates
    content_hashes = [hash(c.content) for c in chunks]
    duplicates = len(content_hashes) - len(set(content_hashes))
    if duplicates > 0:
        print(f"⚠️  {duplicates} duplicate chunks detected")

    # Check token count
    context_text = "\n\n".join([f"[Source: {c.metadata['source']}]\n{c.content}" for c in chunks])
    prompt = f"Context:\n{context_text}\n\nQuestion: {query}"
    token_count = len(encoding.encode(prompt))

    print(f"Context tokens: {token_count}/{max_tokens} ({token_count/max_tokens:.1%})")

    if token_count > max_tokens:
        print(f"⚠️  Context exceeds limit! Need to truncate or re-rank.")

    # Check chunk ordering
    print(f"\nChunk order (by relevance score):")
    for i, c in enumerate(chunks[:5]):
        print(f"  {i+1}. Score={c.score:.3f} - {c.metadata['source']}")

    # Show actual prompt sent to LLM
    print(f"\n{'='*60}")
    print("PROMPT SENT TO LLM (first 500 chars):")
    print(f"{'='*60}")
    print(prompt[:500] + "..." if len(prompt) > 500 else prompt)

    return prompt

Stage 4: Context → LLM Response Debugging

Goal: Verify LLM uses context correctly.

Diagnostic Checklist

System prompt grounds the model

"Answer ONLY based on provided context"

Temperature is appropriate

temperature=0 for factual, 0.3-0.7 for creative

Citations are required

Prompt asks for [Chunk X] citations

Hallucination check passes

Claims in answer match content in context

Debug Code

def debug_llm_response(query: str, context: str, response: str, chunks: list):
    """Debug LLM response for hallucination and grounding"""

    print("LLM RESPONSE ANALYSIS")
    print("=" * 60)
    print(response)
    print("=" * 60)

    # Extract claims (simplified - use NER in production)
    sentences = response.split('.')
    print(f"\nChecking {len(sentences)} claims against context...\n")

    for sent in sentences:
        if len(sent.strip()) < 10:
            continue

        # Check if claim is supported by any chunk
        max_sim = 0
        for chunk in chunks:
            sim = cosine_similarity(
                [embed(sent)],
                [embed(chunk.content)]
            )[0][0]
            max_sim = max(max_sim, sim)

        status = "✓" if max_sim > 0.7 else "⚠️  POTENTIAL HALLUCINATION"
        print(f"{status} {sent.strip()[:80]}...")
        print(f"    Max similarity to context: {max_sim:.3f}\n")

    # Check for "I don't know" when appropriate
    if "not found in the context" in response.lower() or "don't have enough information" in response.lower():
        print("✓ Model correctly indicates uncertainty")

    # Check for citations
    if "[" not in response or "]" not in response:
        print("⚠️  No citations found - model may be hallucinating")

Production Debugging Patterns

Pattern 1: Shadow Mode Debugging

class DebuggableRAG:
    def __init__(self, retriever, llm, debug_mode: bool = False):
        self.retriever = retriever
        self.llm = llm
        self.debug_mode = debug_mode
        self.debug_log = []

    def query(self, question: str) -> dict:
        start_time = time.time()

        # Stage 1: Retrieve
        chunks = self.retriever.search(question, k=10)

        # Stage 2: Assemble context
        context = self._assemble_context(chunks)

        # Stage 3: Generate
        response = self.llm.invoke(f"Context: {context}\n\nQuestion: {question}")

        # Debug logging
        if self.debug_mode:
            self.debug_log.append({
                "query": question,
                "retrieved_chunks": len(chunks),
                "context_tokens": len(context),
                "response": response,
                "latency_ms": (time.time() - start_time) * 1000,
                "timestamp": datetime.now().isoformat()
            })

        return {
            "answer": response,
            "debug": {
                "chunks": [c.content for c in chunks],
                "scores": [c.score for c in chunks],
                "context": context,
                "latency_ms": (time.time() - start_time) * 1000
            } if self.debug_mode else None
        }

Pattern 2: Golden Set Evaluation

import json
from dataclasses import dataclass

@dataclass
class EvalQuery:
    query: str
    expected_answer: str
    expected_sources: list  # Document IDs that should be retrieved

def evaluate_rag(rag, eval_set: list[EvalQuery]) -> dict:
    """Evaluate RAG against golden set"""

    results = []
    for eq in eval_set:
        response = rag.query(eq.query, debug_mode=True)

        # Check retrieval recall
        retrieved_ids = [c.metadata['id'] for c in response['debug']['chunks']]
        recall = len(set(retrieved_ids) & set(eq.expected_sources)) / len(eq.expected_sources)

        # Check answer similarity (use embedding similarity as proxy)
        answer_sim = cosine_similarity(
            [embed(response['answer'])],
            [embed(eq.expected_answer)]
        )[0][0]

        results.append({
            "query": eq.query,
            "recall": recall,
            "answer_similarity": answer_sim,
            "latency_ms": response['debug']['latency_ms']
        })

    # Aggregate metrics
    return {
        "avg_recall": np.mean([r['recall'] for r in results]),
        "avg_answer_similarity": np.mean([r['answer_similarity'] for r in results]),
        "avg_latency_ms": np.mean([r['latency_ms'] for r in results]),
        "p95_latency_ms": np.percentile([r['latency_ms'] for r in results], 95),
        "details": results
    }

# Usage
eval_results = evaluate_rag(rag, golden_set)
print(f"Recall@10: {eval_results['avg_recall']:.2%}")
print(f"Answer Similarity: {eval_results['avg_answer_similarity']:.3f}")

🚀 Debug Your RAG in Minutes

RAG Debugger automates this entire workflow:

4-stage diagnostic dashboard
Automatic hallucination detection
Side-by-side chunk comparison
Export debug reports for team review

Try 10 free debug sessions → rag-debugger.pages.dev

Quick Reference: Common Issues & Fixes

Issue	Stage	Quick Fix
Low similarity scores	Query→Embed	Expand query + better model
Wrong docs retrieved	Embed→Retrieve	Add hybrid search + reranking
Context too long	Retrieve→Context	Re-rank + compress to top 5
Hallucinated answers	Context→Response	Ground prompt + citation requirement

Conclusion

RAG debugging becomes manageable when you follow a systematic workflow:

Query → Embedding: Verify semantic representation
Embedding → Retrieval: Check recall and relevance
Retrieval → Context: Validate formatting and token count
Context → Response: Detect hallucination and grounding issues

For faster debugging, try RAG Debugger — a visual tool that automates this entire workflow. Start with 10 free sessions at rag-debugger.pages.dev.