8 Common RAG Failure Modes and How to Fix Them

Overview

RAG systems fail in predictable ways. This guide covers 8 failure modes you'll encounter in production, with diagnostic techniques and fixes that work at scale.

Failure Mode 1: Poor Retrieval Accuracy

Symptom: The LLM hallucinates or says "I don't have enough information" despite relevant documents existing in the knowledge base.

Diagnosis:

Log the top-k retrieved chunks for failed queries
Calculate semantic similarity scores between query and top results
Check if relevant documents are ranked below position k

Fixes:

# Adjust retrieval parameters
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance
    search_kwargs={
        "k": 10,        # Increase from default 4
        "fetch_k": 50,  # Fetch more candidates before MMR
        "lambda_mult": 0.7  # Diversity parameter
    }
)

Failure Mode 2: Context Window Overflow

Symptom: Truncation warnings, incomplete answers, or API errors about token limits.

Diagnosis:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-4")
total_tokens = sum(len(encoding.encode(doc.page_content)) for doc in retrieved_docs)
print(f"Retrieved context: {total_tokens} tokens")  # Should be < 70% of limit

Fixes:

Use reranking to compress k=20 candidates down to k=5 highest quality
Implement context compression with extractive summarization
Switch to models with larger context windows (Claude 200K vs GPT-4 128K)

Failure Mode 3: Suboptimal Chunking Strategy

Symptom: Answers cut off mid-sentence, or critical context is split across chunks.

Diagnosis:

Inspect retrieved chunks and check for:

Chunks that end abruptly without sentence boundaries
Related concepts split across multiple chunks
Orphaned pronouns ("it", "this") without antecedents

Fixes:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,  # Critical: preserve context across boundaries
    separators=["\n\n", "\n", ". ", " ", ""],  # Respect semantic boundaries
    length_function=len
)

Failure Mode 4: Embedding Model Mismatch

Symptom: Query-document similarity scores are artificially low, or irrelevant results rank highly.

Diagnosis:

# Compare query and document representations
query_embedding = embed_model.embed_query("What is the refund policy?")
doc_embedding = embed_model.embed_documents(["Returns accepted within 30 days"])

from numpy import dot
from numpy.linalg import norm
similarity = dot(query_embedding, doc_embedding[0]) / (norm(query_embedding) * norm(doc_embedding[0]))
print(f"Cosine similarity: {similarity:.3f}")  # Should be > 0.5 for relevant pairs

Fixes:

Use domain-specific embeddings (e.g., msmarco-MiniLM for search, legal-bert for contracts)
Fine-tune on your query-document pairs with hard negatives
Implement hybrid search: combine embeddings with BM25 keyword search

Failure Mode 5: Metadata Filtering Errors

Symptom: Multi-tenant RAG returns data from wrong tenant, or time-filtered queries return outdated results.

Diagnosis:

# Verify metadata is propagated correctly
for doc in retrieved_docs:
    print(f"Tenant: {doc.metadata.get('tenant_id')}")
    print(f"Timestamp: {doc.metadata.get('created_at')}")
    assert doc.metadata.get('tenant_id') == current_user.tenant_id

Fixes:

# Apply metadata filters BEFORE vector search (not after)
retriever = vectorstore.as_retriever(
    search_kwargs={
        "filter": {
            "tenant_id": {"$eq": user.tenant_id},
            "created_at": {"$gte": "2026-01-01"}
        },
        "k": 10
    }
)

Failure Mode 6: System Prompt Drift

Symptom: LLM ignores retrieval context, generates generic responses, or violates output format constraints.

Diagnosis:

Check if retrieved context appears in LLM output:

retrieved_text = "\n".join(doc.page_content for doc in docs)
if not any(phrase in llm_output for phrase in retrieved_text.split()[:10]):
    print("WARNING: LLM did not use retrieved context")

Fixes:

Use explicit instruction: "ONLY answer using the provided context"
Implement SCAN pattern: re-inject system prompt every N turns
Add citation requirement: "Include [doc_id] for each claim"

Failure Mode 7: Stale Vector Index

Symptom: Newly added documents don't appear in search results, or deleted documents still show up.

Diagnosis:

# Check index freshness
vectorstore_count = vectorstore.index.ntotal  # Pinecone/Qdrant/Weaviate
database_count = db.query("SELECT COUNT(*) FROM documents").scalar()
if vectorstore_count != database_count:
    print(f"Index out of sync: {vectorstore_count} vs {database_count}")

Fixes:

Implement incremental updates: index only new/modified documents
Use CDC (Change Data Capture) to trigger re-indexing
Add versioning: store indexed_at timestamp in metadata

Failure Mode 8: Evaluation Blind Spots

Symptom: RAG appears to work in demos but fails in production with real user queries.

Diagnosis:

Build a test set from production failures:

# Collect ground truth pairs
test_cases = [
    {"query": "What's the SLA for P1 incidents?", "expected_answer": "4 hours", "expected_doc_id": "SLA-2024"},
    {"query": "How do I reset my password?", "expected_answer": "Click 'Forgot Password'", "expected_doc_id": "KB-001"}
]

# Measure retrieval + generation quality
for case in test_cases:
    docs = retriever.get_relevant_documents(case["query"])
    assert any(case["expected_doc_id"] in doc.metadata.get("id", "") for doc in docs), "Retrieval failed"

Fixes:

Track retrieval accuracy: precision@k, recall@k, MRR
Measure generation quality: BLEU, ROUGE, or LLM-as-judge
Monitor in production: log every query where LLM says "I don't know"

🛠️ RAG Debugger — Visual Debugging for RAG Pipelines

Tired of debugging RAG failures with print statements? RAG Debugger gives you:

📊 Visual waterfall — See retrieval → rerank → generation in real-time
🔍 Similarity heatmaps — Spot embedding quality issues instantly
📈 Chunk provenance — Trace which chunks influenced each answer
⚡ One-click fixes — Adjust chunk size, k, embeddings without code changes

Try RAG Debugger Free →

FAQ

What's the most common RAG failure mode?

Poor retrieval accuracy (Failure Mode 1) accounts for ~40% of production issues. Start by logging top-k results and checking if relevant documents are even being retrieved.

How do I know if my RAG is failing silently?

Implement observability: log every query, track when LLM says "I don't know", measure retrieval precision@k, and compare answers against ground truth test sets.

Should I use semantic search or keyword search?

Use both (hybrid search). Semantic search (embeddings) handles synonyms and paraphrasing, but keyword search (BM25) is better for exact matches like product codes or error messages.

How do I debug context window overflow?

Token-count retrieved context before sending to LLM. Target 60-70% of context window for retrieval, leaving room for system prompt and output. Use reranking to compress k=20 candidates down to k=5.

Deploy Your RAG Pipeline — Recommended Hosting

🌐

Hostinger

Web Hosting from $2.99/mo

💧

DigitalOcean

$200 Free Credit