Overview
LlamaIndex has unique abstractions (nodes, query engines, response synthesizers) that introduce framework-specific bugs. This guide covers debugging techniques for LlamaIndex RAG pipelines.
Issue 1: Query Engine Mode Selection
Problem: LlamaIndex offers 10+ query engine modes (retriever, compact, tree_summarize, etc.). Wrong mode causes slow responses or incorrect answers.
Diagnosis: Compare Modes Side-by-Side
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Test different modes with same query
query = "What is the company's refund policy?"
modes = ["default", "compact", "tree_summarize", "refine"]
for mode in modes:
engine = index.as_query_engine(response_mode=mode)
response = engine.query(query)
print(f"\n=== {mode.upper()} ===")
print(f"Answer: {response.response}")
print(f"Source nodes: {len(response.source_nodes)}")
Fix: Mode Selection Matrix
- default: Fast, good for direct answers (e.g., "What is X?")
- compact: Fills context window to max before truncating
- tree_summarize: Hierarchical summarization for long docs
- refine: Iteratively refine answer across chunks (slow but accurate)
- accumulate: Separate answer per chunk, then combine
Issue 2: Response Synthesizer Context Overflow
Problem: LlamaIndex retrieves 10 nodes × 512 tokens = 5120 tokens, but LLM context window is 4096. Response truncated.
Diagnosis: Token Counting
from llama_index.core import ServiceContext
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo") # 4096 token limit
service_context = ServiceContext.from_defaults(llm=llm)
query_engine = index.as_query_engine(
similarity_top_k=10,
service_context=service_context
)
response = query_engine.query("What is the refund policy?")
# Check how much context was used
total_tokens = sum(len(node.get_content().split()) for node in response.source_nodes)
print(f"Retrieved context: ~{total_tokens * 1.3:.0f} tokens") # 1.3 = avg tokens/word
print(f"LLM limit: 4096 tokens")
if total_tokens * 1.3 > 3000: # Leave room for prompt + output
print("⚠️ Context overflow likely")
Fix: Adjust similarity_top_k or Use Compact Mode
query_engine = index.as_query_engine(
similarity_top_k=5, # Reduce from 10
response_mode="compact", # Auto-fit context window
service_context=service_context
)
Issue 3: Node Parser Chunking Errors
Problem: Default SentenceSplitter breaks code blocks, tables, or lists across chunks.
Diagnosis: Inspect Nodes
from llama_index.core.node_parser import SentenceSplitter
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)
for i, node in enumerate(nodes[:5]):
print(f"\n=== Node {i} ===")
print(f"Text: {node.get_content()[:200]}...")
print(f"Metadata: {node.metadata}")
# Check for broken code blocks
if "```" in node.get_content() and node.get_content().count("```") % 2 != 0:
print("⚠️ WARNING: Code block split across nodes")
Fix: Use Semantic Splitter
from llama_index.core.node_parser import SemanticSplitterNodeParser
# Splits only when semantic similarity drops below threshold
parser = SemanticSplitterNodeParser(
buffer_size=1, # Number of sentences to consider for boundary
breakpoint_percentile_threshold=95 # Split when similarity < 5th percentile
)
nodes = parser.get_nodes_from_documents(documents)
Issue 4: Metadata Extraction Failures
Problem: LlamaIndex auto-extracts metadata (summaries, keywords, questions) but fails on unstructured text.
Diagnosis: Check Extracted Metadata
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
KeywordExtractor
)
extractors = [
SummaryExtractor(summaries=["prev", "self"]),
QuestionsAnsweredExtractor(questions=3),
KeywordExtractor(keywords=10)
]
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=extractors)
nodes = pipeline.run(documents=documents)
for node in nodes[:3]:
print(f"\n=== Node Metadata ===")
print(f"Summary: {node.metadata.get('section_summary')}")
print(f"Questions: {node.metadata.get('questions_this_excerpt_can_answer')}")
print(f"Keywords: {node.metadata.get('excerpt_keywords')}")
Fix: Custom Metadata Extractors
from llama_index.core.extractors import BaseExtractor
class CustomExtractor(BaseExtractor):
def extract(self, nodes):
for node in nodes:
# Add your custom metadata logic
node.metadata["custom_field"] = extract_custom_info(node.get_content())
return nodes
pipeline = IngestionPipeline(
transformations=[CustomExtractor(), SummaryExtractor()]
)
Issue 5: Chat Engine Memory Management
Problem: CondenseQuestionChatEngine stores entire conversation history, causing context overflow after 20+ turns.
Diagnosis: Track Conversation Length
from llama_index.core.chat_engine import CondenseQuestionChatEngine
from llama_index.core.memory import ChatMemoryBuffer
memory = ChatMemoryBuffer.from_defaults(token_limit=2000)
chat_engine = CondenseQuestionChatEngine.from_defaults(
query_engine=query_engine,
memory=memory
)
for i in range(50):
response = chat_engine.chat(f"Question {i}")
# Check memory usage
history = memory.get_all()
total_tokens = sum(len(msg.content.split()) * 1.3 for msg in history)
print(f"Turn {i}: {len(history)} messages, ~{total_tokens:.0f} tokens")
Fix: Use Token-Limited Memory
from llama_index.core.memory import ChatMemoryBuffer
# Automatically evict old messages when limit exceeded
memory = ChatMemoryBuffer.from_defaults(
token_limit=2000, # Hard limit
tokenizer_fn=lambda text: len(text.split()) * 1.3 # Rough token count
)
chat_engine = CondenseQuestionChatEngine.from_defaults(
query_engine=query_engine,
memory=memory
)
Issue 6: Retriever vs Query Engine Confusion
Problem: LlamaIndex has both retrievers (return nodes) and query engines (return synthesized response). Using wrong one causes type errors.
Diagnosis: Understand Abstractions
# Retriever: Returns nodes (documents)
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve("What is the refund policy?")
print(type(nodes)) # list[NodeWithScore]
# Query Engine: Returns response (string + metadata)
query_engine = index.as_query_engine()
response = query_engine.query("What is the refund policy?")
print(type(response)) # Response object with .response, .source_nodes
Fix: Use Correct Abstraction
# For custom post-processing: use retriever
retriever = index.as_retriever()
nodes = retriever.retrieve(query)
filtered_nodes = [n for n in nodes if n.score > 0.7]
# Then manually synthesize response
# For end-to-end RAG: use query engine
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response.response)
Issue 7: Vector Store Index vs Document Summary Index
Problem: Choosing wrong index type causes slow queries or irrelevant results.
When to Use Each Index
- VectorStoreIndex: Semantic search, works for all query types
- SummaryIndex: Keyword search, fast but less accurate
- TreeIndex: Hierarchical data, good for "summarize all docs"
- KeywordTableIndex: Exact keyword matching, fast lookups
Fix: Hybrid Index
from llama_index.core import VectorStoreIndex, SummaryIndex
from llama_index.core.retrievers import QueryFusionRetriever
# Combine semantic + keyword search
vector_retriever = VectorStoreIndex.from_documents(documents).as_retriever()
keyword_retriever = SummaryIndex.from_documents(documents).as_retriever()
retriever = QueryFusionRetriever(
[vector_retriever, keyword_retriever],
similarity_top_k=5,
num_queries=1 # Don't generate multiple query variations
)
Stop guessing which query engine mode to use. RAG Debugger provides:
- 🔬 Mode comparison UI — Test default vs compact vs refine side-by-side
- 📊 Node inspector — See exactly what each chunk contains
- 🎯 Response synthesizer tracer — Debug tree_summarize recursion
- 💾 Memory profiler — Track ChatMemoryBuffer token usage over time
FAQ
Which query engine mode should I use?
Start with "default" for direct questions. Use "compact" if hitting context limits. Use "tree_summarize" for "summarize these 50 documents" queries. Use "refine" only when accuracy matters more than speed.
How do I debug response synthesizer failures?
Enable verbose logging: set_global_handler("simple"). This prints every LLM call, showing exactly what prompt was sent and what response came back.
Why are my chunks breaking in the middle of code blocks?
Default SentenceSplitter is sentence-aware but not structure-aware. Use SemanticSplitterNodeParser or MarkdownNodeParser (for markdown files) to respect document structure.
How do I prevent chat engine context overflow?
Use ChatMemoryBuffer with token_limit=2000. This auto-evicts old messages when limit exceeded. For long conversations, switch to ConversationSummaryMemory to compress history.