Overview
RAG systems fail in predictable ways. This guide covers 8 failure modes you'll encounter in production, with diagnostic techniques and fixes that work at scale.
Failure Mode 1: Poor Retrieval Accuracy
Symptom: The LLM hallucinates or says "I don't have enough information" despite relevant documents existing in the knowledge base.
Diagnosis:
- Log the top-k retrieved chunks for failed queries
- Calculate semantic similarity scores between query and top results
- Check if relevant documents are ranked below position k
Fixes:
# Adjust retrieval parameters
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance
search_kwargs={
"k": 10, # Increase from default 4
"fetch_k": 50, # Fetch more candidates before MMR
"lambda_mult": 0.7 # Diversity parameter
}
)
Failure Mode 2: Context Window Overflow
Symptom: Truncation warnings, incomplete answers, or API errors about token limits.
Diagnosis:
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")
total_tokens = sum(len(encoding.encode(doc.page_content)) for doc in retrieved_docs)
print(f"Retrieved context: {total_tokens} tokens") # Should be < 70% of limit
Fixes:
- Use reranking to compress k=20 candidates down to k=5 highest quality
- Implement context compression with extractive summarization
- Switch to models with larger context windows (Claude 200K vs GPT-4 128K)
Failure Mode 3: Suboptimal Chunking Strategy
Symptom: Answers cut off mid-sentence, or critical context is split across chunks.
Diagnosis:
Inspect retrieved chunks and check for:
- Chunks that end abruptly without sentence boundaries
- Related concepts split across multiple chunks
- Orphaned pronouns ("it", "this") without antecedents
Fixes:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50, # Critical: preserve context across boundaries
separators=["\n\n", "\n", ". ", " ", ""], # Respect semantic boundaries
length_function=len
)
Failure Mode 4: Embedding Model Mismatch
Symptom: Query-document similarity scores are artificially low, or irrelevant results rank highly.
Diagnosis:
# Compare query and document representations
query_embedding = embed_model.embed_query("What is the refund policy?")
doc_embedding = embed_model.embed_documents(["Returns accepted within 30 days"])
from numpy import dot
from numpy.linalg import norm
similarity = dot(query_embedding, doc_embedding[0]) / (norm(query_embedding) * norm(doc_embedding[0]))
print(f"Cosine similarity: {similarity:.3f}") # Should be > 0.5 for relevant pairs
Fixes:
- Use domain-specific embeddings (e.g.,
msmarco-MiniLMfor search,legal-bertfor contracts) - Fine-tune on your query-document pairs with hard negatives
- Implement hybrid search: combine embeddings with BM25 keyword search
Failure Mode 5: Metadata Filtering Errors
Symptom: Multi-tenant RAG returns data from wrong tenant, or time-filtered queries return outdated results.
Diagnosis:
# Verify metadata is propagated correctly
for doc in retrieved_docs:
print(f"Tenant: {doc.metadata.get('tenant_id')}")
print(f"Timestamp: {doc.metadata.get('created_at')}")
assert doc.metadata.get('tenant_id') == current_user.tenant_id
Fixes:
# Apply metadata filters BEFORE vector search (not after)
retriever = vectorstore.as_retriever(
search_kwargs={
"filter": {
"tenant_id": {"$eq": user.tenant_id},
"created_at": {"$gte": "2026-01-01"}
},
"k": 10
}
)
Failure Mode 6: System Prompt Drift
Symptom: LLM ignores retrieval context, generates generic responses, or violates output format constraints.
Diagnosis:
Check if retrieved context appears in LLM output:
retrieved_text = "\n".join(doc.page_content for doc in docs)
if not any(phrase in llm_output for phrase in retrieved_text.split()[:10]):
print("WARNING: LLM did not use retrieved context")
Fixes:
- Use explicit instruction: "ONLY answer using the provided context"
- Implement SCAN pattern: re-inject system prompt every N turns
- Add citation requirement: "Include [doc_id] for each claim"
Failure Mode 7: Stale Vector Index
Symptom: Newly added documents don't appear in search results, or deleted documents still show up.
Diagnosis:
# Check index freshness
vectorstore_count = vectorstore.index.ntotal # Pinecone/Qdrant/Weaviate
database_count = db.query("SELECT COUNT(*) FROM documents").scalar()
if vectorstore_count != database_count:
print(f"Index out of sync: {vectorstore_count} vs {database_count}")
Fixes:
- Implement incremental updates: index only new/modified documents
- Use CDC (Change Data Capture) to trigger re-indexing
- Add versioning: store
indexed_attimestamp in metadata
Failure Mode 8: Evaluation Blind Spots
Symptom: RAG appears to work in demos but fails in production with real user queries.
Diagnosis:
Build a test set from production failures:
# Collect ground truth pairs
test_cases = [
{"query": "What's the SLA for P1 incidents?", "expected_answer": "4 hours", "expected_doc_id": "SLA-2024"},
{"query": "How do I reset my password?", "expected_answer": "Click 'Forgot Password'", "expected_doc_id": "KB-001"}
]
# Measure retrieval + generation quality
for case in test_cases:
docs = retriever.get_relevant_documents(case["query"])
assert any(case["expected_doc_id"] in doc.metadata.get("id", "") for doc in docs), "Retrieval failed"
Fixes:
- Track retrieval accuracy: precision@k, recall@k, MRR
- Measure generation quality: BLEU, ROUGE, or LLM-as-judge
- Monitor in production: log every query where LLM says "I don't know"
Tired of debugging RAG failures with print statements? RAG Debugger gives you:
- 📊 Visual waterfall — See retrieval → rerank → generation in real-time
- 🔍 Similarity heatmaps — Spot embedding quality issues instantly
- 📈 Chunk provenance — Trace which chunks influenced each answer
- ⚡ One-click fixes — Adjust chunk size, k, embeddings without code changes
FAQ
What's the most common RAG failure mode?
Poor retrieval accuracy (Failure Mode 1) accounts for ~40% of production issues. Start by logging top-k results and checking if relevant documents are even being retrieved.
How do I know if my RAG is failing silently?
Implement observability: log every query, track when LLM says "I don't know", measure retrieval precision@k, and compare answers against ground truth test sets.
Should I use semantic search or keyword search?
Use both (hybrid search). Semantic search (embeddings) handles synonyms and paraphrasing, but keyword search (BM25) is better for exact matches like product codes or error messages.
How do I debug context window overflow?
Token-count retrieved context before sending to LLM. Target 60-70% of context window for retrieval, leaving room for system prompt and output. Use reranking to compress k=20 candidates down to k=5.