LlamaIndex provides powerful abstractions for RAG applications. But when your query engine returns poor results, debugging can be frustrating.
Is the issue with node parsing? Retriever configuration? Response synthesizer? Without understanding LlamaIndex's architecture, you're guessing.
This guide covers common LlamaIndex RAG issues with working fixes.
🔧 Visual RAG Debugging
Use rag-debugger.pages.dev to visualize LlamaIndex query outputs. Paste retrieved nodes and responses to identify failures. Free: 10 sessions/month.
LlamaIndex RAG Architecture
Understanding the data flow helps isolate issues:
Documents → NodeParser → Nodes → VectorStore
↓
Query → QueryEngine → Retriever → Nodes → ResponseSynthesizer → Response
Each stage can introduce problems. Let's debug them systematically.
Issue 1: Node Parsing Problems
Problem: Chunks are too small, too large, or split at wrong boundaries
Symptoms:
- Retrieved nodes lack context
- Code examples split from explanations
- Tables fragmented across multiple nodes
Debug:
from llama_index.core import Document
from llama_index.core.node_parser import SimpleNodeParser
# Create parser
parser = SimpleNodeParser.from_defaults(
chunk_size=512,
chunk_overlap=50,
separator="\n"
)
# Parse and inspect
nodes = parser.get_nodes_from_documents(documents)
print(f"Total nodes: {len(nodes)}")
print(f"Avg chunk size: {sum(len(n.text) for n in nodes) / len(nodes):.0f} chars")
# Inspect first few nodes
for i, node in enumerate(nodes[:5]):
print(f"\n--- Node {i+1} ---")
print(f"Size: {len(node.text)} chars")
print(f"Metadata: {node.metadata}")
print(f"Content preview: {node.text[:200]}...")
Fixes:
# Fix 1: Use semantic chunking for better boundaries
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding
semantic_parser = SemanticSplitterNodeParser(
buffer_size=1,
breakpoint_percentile_threshold=95,
embed_model=OpenAIEmbedding()
)
nodes = semantic_parser.get_nodes_from_documents(documents)
# Fix 2: Use code-aware parser for technical docs
from llama_index.core.node_parser import CodeSplitter
code_parser = CodeSplitter(
language="python", # or "javascript", "java", etc.
chunk_lines=50,
chunk_lines_overlap=10
)
# Fix 3: Use hierarchical parsing for long documents
from llama_index.core.node_parser import HierarchicalNodeParser
hierarchical_parser = HierarchicalNodeParser.from_chunk_sizes(
chunk_sizes=[2048, 512, 128] # Parent → child sizes
)
nodes = hierarchical_parser.get_nodes_from_documents(documents)
# Returns nodes at multiple levels for parent-child retrieval
Issue 2: Retriever Returns Poor Results
Problem: Top-K nodes are irrelevant or low quality
Debug:
from llama_index.core import VectorStoreIndex
from llama_index.core.postprocessor import SimilarityPostprocessor
# Create index and retriever
index = VectorStoreIndex(nodes)
retriever = index.as_retriever(
similarity_top_k=10,
similarity_cutoff=0.5 # Filter low scores
)
# Test retrieval
query = "your test query"
nodes = retriever.retrieve(query)
print(f"Retrieved {len(nodes)} nodes\n")
for i, node in enumerate(nodes):
print(f"{i+1}. Score: {node.score:.3f}")
print(f" Text: {node.node.text[:150]}...")
print(f" Metadata: {node.node.metadata}\n")
# Check score distribution
scores = [n.score for n in nodes]
print(f"Score stats: min={min(scores):.3f}, max={max(scores):.3f}, avg={sum(scores)/len(scores):.3f}")
Fixes:
# Fix 1: Adjust similarity cutoff
retriever = index.as_retriever(
similarity_top_k=10,
similarity_cutoff=0.3 # Lower for recall, higher for precision
)
# Fix 2: Use MMR for diversity
from llama_index.core.retrievers import RecursiveRetriever
retriever = index.as_retriever(
retriever_mode="hybrid", # Combines dense + sparse
hybrid_top_k=10
)
# Fix 3: Add reranking
from llama_index.core.postprocessor import SentenceTransformerRerank
reranker = SentenceTransformerRerank(
model="BAAI/bge-reranker-base",
top_n=5
)
query_engine = index.as_query_engine(
postprocessors=[reranker],
similarity_top_k=20 # Retrieve more, rerank to top 5
)
# Fix 4: Use auto-retrieval for large corpora
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import FusionRetriever
bm25 = BM25Retriever.from_defaults(nodes, similarity_top_k=5)
vector = index.as_retriever(similarity_top_k=5)
fusion_retriever = FusionRetriever(
[bm25, vector],
num_top_k=10,
fusion_weights=[0.3, 0.7] # BM25 + dense
)
Issue 3: Response Synthesis Problems
Problem: Answer ignores retrieved context or hallucinates
Debug:
from llama_index.core import get_response_synthesizer
# Check current synthesizer
synthesizer = get_response_synthesizer(llm=llm)
response = synthesizer.synthesize(
query="your query",
nodes=retrieved_nodes
)
print("Response:")
print(response)
print("\nSource nodes:")
for node in response.source_nodes:
print(f"- Score: {node.score:.3f}: {node.node.text[:100]}...")
Fixes:
# Fix 1: Use refine mode for better grounding
from llama_index.core.response_synthesizers import ResponseMode
query_engine = index.as_query_engine(
response_mode="refine", # Iteratively refine with each node
similarity_top_k=5
)
# Fix 2: Custom system prompt for grounding
from llama_index.core import get_response_synthesizer
from llama_index.core.prompts import PromptTemplate
grounded_template = PromptTemplate("""
<|system|>
You are an assistant that answers questions based ONLY on the provided context.
- If the answer is not in the context, say "I don't have enough information."
- Cite sources using [Source X] notation.
- Quote exact passages when making factual claims.
Context:
{context_str}
<|user|>
{query_str}
<|assistant|>
""")
synthesizer = get_response_synthesizer(
llm=llm,
text_qa_template=grounded_template,
response_mode="compact"
)
# Fix 3: Add citation requirements
query_engine = index.as_query_engine(
synthesizer=synthesizer,
node_postprocessors=[
# Add citation metadata
CitationPostprocessor()
]
)
# Fix 4: Lower temperature for factual queries
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-3.5-turbo",
temperature=0, # Deterministic
max_tokens=1000
)
Issue 4: Query Engine Configuration
Problem: Wrong query engine type for use case
Query Engine Comparison:
| Type | Best For | Config |
|---|---|---|
| VectorQueryEngine | Standard RAG | index.as_query_engine() |
| RetrieverQueryEngine | Custom retrievers | RetrieverQueryEngine(retriever, synthesizer) |
| SubQuestionQueryEngine | Multi-hop reasoning | SubQuestionQueryEngine.from_defaults() |
| CitationQueryEngine | Verified answers with citations | CitationQueryEngine.from_defaults() |
Example: Multi-Hop with Sub-Questions
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool
# Create query engine tools for each data source
product_engine = product_index.as_query_engine()
pricing_engine = pricing_index.as_query_engine()
tools = [
QueryEngineTool.from_defaults(
query_engine=product_engine,
name="product_info",
description="Product specifications and features"
),
QueryEngineTool.from_defaults(
query_engine=pricing_engine,
name="pricing_info",
description="Pricing tiers and costs"
)
]
# Sub-question engine will decompose complex queries
query_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=tools,
llm=llm,
verbose=True
)
response = query_engine.query("What's the price of the Pro plan with advanced features?")
# Automatically asks: "What is the Pro plan?" → "What are advanced features?" → "What's the price?"
Issue 5: Metadata Filtering Not Working
Problem: Filters don't narrow results as expected
Debug:
from llama_index.core import VectorStoreIndex
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo
# Check what metadata is available
vector_store_info = VectorStoreInfo(
metadata_info=[
MetadataInfo(name="doc_type", type="str", description="Type of document"),
MetadataInfo(name="version", type="str", description="Document version"),
MetadataInfo(name="created_at", type="int", description="Unix timestamp"),
]
)
# Inspect node metadata
for node in nodes[:3]:
print(f"Node metadata: {node.metadata}")
# Test filter
from llama_index.core.vector_stores import MetadataFilters, FilterCondition
filters = MetadataFilters(
filters=[
{"key": "doc_type", "value": "api_docs", "operator": "=="},
{"key": "version", "value": "2.0", "operator": ">="}
],
condition=FilterCondition.AND
)
retriever = index.as_retriever(filters=filters)
results = retriever.retrieve("your query")
print(f"Filtered results: {len(results)}")
Fixes:
# Fix 1: Ensure metadata is indexed
from llama_index.core import StorageContext, VectorStoreIndex
# Re-index with metadata
for node in nodes:
# Ensure metadata is serializable
node.metadata = {
k: str(v) if not isinstance(v, (str, int, float, bool)) else v
for k, v in node.metadata.items()
}
# Rebuild index
index = VectorStoreIndex(nodes)
# Fix 2: Use metadata + similarity combined
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="doc_type", value="api_docs"),
{"key": "similarity", "value": 0.5, "operator": ">="} # Score threshold
]
)
# Fix 3: Post-filter if vectorstore doesn't support pre-filtering
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
query_engine = index.as_query_engine(
postprocessors=[
MetadataReplacementPostProcessor.from_filters(filters)
]
)
Debugging Utilities
Utility 1: Query Trace Logger
import json
from datetime import datetime
from llama_index.core.callbacks import CallbackManager, BaseCallbackHandler
class QueryTraceHandler(BaseCallbackHandler):
def __init__(self, log_file: str = "query_traces.jsonl"):
super().__init__()
self.log_file = log_file
self.current_query = None
self.retrieved_nodes = []
self.response = None
def on_query_start(self, query_str: str, **kwargs):
self.current_query = query_str
self.retrieved_nodes = []
def on_retrieval_end(self, nodes, **kwargs):
self.retrieved_nodes = [
{"text": n.node.text[:200], "score": n.score, "metadata": n.node.metadata}
for n in nodes
]
def on_query_end(self, response, **kwargs):
trace = {
"timestamp": datetime.now().isoformat(),
"query": self.current_query,
"retrieved_nodes": self.retrieved_nodes,
"response": str(response),
"source_count": len(response.source_nodes) if hasattr(response, 'source_nodes') else 0
}
with open(self.log_file, "a") as f:
f.write(json.dumps(trace) + "\n")
# Usage
trace_handler = QueryTraceHandler()
callback_manager = CallbackManager([trace_handler])
query_engine = index.as_query_engine(callback_manager=callback_manager)
response = query_engine.query("your query")
Utility 2: Response Quality Checker
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
class ResponseQualityChecker:
def __init__(self):
self.model = SentenceTransformer('all-MiniLM-L6-v2')
def check(self, query: str, response: str, source_nodes: list) -> dict:
"""Check response quality metrics"""
# 1. Query-Response relevance
query_response_sim = cosine_similarity(
self.model.encode([query]),
self.model.encode([response])
)[0][0]
# 2. Source coverage (does response use sources?)
response_sentences = [s.strip() for s in response.split('.') if len(s) > 10]
source_texts = [n.node.text for n in source_nodes]
covered = 0
for sent in response_sentences:
max_sim = max(
cosine_similarity(
self.model.encode([sent]),
self.model.encode(source_texts)
)[0]
)
if max_sim > 0.5:
covered += 1
coverage_rate = covered / len(response_sentences) if response_sentences else 0
# 3. Confidence (based on source scores)
source_scores = [n.score for n in source_nodes] if source_nodes else [0]
avg_confidence = np.mean(source_scores)
return {
"query_response_similarity": query_response_sim,
"source_coverage": coverage_rate,
"confidence": avg_confidence,
"quality_score": (query_response_sim + coverage_rate + avg_confidence) / 3
}
# Usage
checker = ResponseQualityChecker()
quality = checker.check(query, str(response), response.source_nodes)
print(f"Quality Score: {quality['quality_score']:.2f}")
print(f" - Query-Response Sim: {quality['query_response_similarity']:.2f}")
print(f" - Source Coverage: {quality['source_coverage']:.2f}")
print(f" - Confidence: {quality['confidence']:.2f}")
if quality['quality_score'] < 0.5:
print("⚠️ Low quality response - consider retrieving more/better sources")
🚀 Visual RAG Debugging for LlamaIndex
RAG Debugger works with LlamaIndex:
- Visualize retrieved nodes with scores
- Detect hallucination in responses
- Compare different query engine configs
- Export traces for team review
Try 10 free debug sessions → rag-debugger.pages.dev
Quick Reference: Configuration Guide
from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
# Settings
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# Index with reranking
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[
SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=5)
],
response_mode="refine"
)
Conclusion
Debugging LlamaIndex RAG requires understanding each component:
- Node parsing: Use semantic or hierarchical parsers for better chunks
- Retrieval: Tune similarity cutoff, use hybrid or MMR for diversity
- Response synthesis: Use refine mode with grounded prompts
- Query engine selection: Match engine type to use case
- Metadata filtering: Ensure proper indexing and filter syntax
For faster debugging, try RAG Debugger — a visual tool that analyzes LlamaIndex traces and detects failures automatically. Start with 10 free sessions at rag-debugger.pages.dev.