LlamaIndex RAG Debugging Guide — Query Engines and Response Synthesis

LlamaIndex-specific RAG issues: query engine modes, response synthesizer failures, node parser bugs, and metadata extraction.

Overview

LlamaIndex has unique abstractions (nodes, query engines, response synthesizers) that introduce framework-specific bugs. This guide covers debugging techniques for LlamaIndex RAG pipelines.

Issue 1: Query Engine Mode Selection

Problem: LlamaIndex offers 10+ query engine modes (retriever, compact, tree_summarize, etc.). Wrong mode causes slow responses or incorrect answers.

Diagnosis: Compare Modes Side-by-Side

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Test different modes with same query
query = "What is the company's refund policy?"

modes = ["default", "compact", "tree_summarize", "refine"]
for mode in modes:
    engine = index.as_query_engine(response_mode=mode)
    response = engine.query(query)
    print(f"\n=== {mode.upper()} ===")
    print(f"Answer: {response.response}")
    print(f"Source nodes: {len(response.source_nodes)}")

Fix: Mode Selection Matrix

  • default: Fast, good for direct answers (e.g., "What is X?")
  • compact: Fills context window to max before truncating
  • tree_summarize: Hierarchical summarization for long docs
  • refine: Iteratively refine answer across chunks (slow but accurate)
  • accumulate: Separate answer per chunk, then combine

Issue 2: Response Synthesizer Context Overflow

Problem: LlamaIndex retrieves 10 nodes × 512 tokens = 5120 tokens, but LLM context window is 4096. Response truncated.

Diagnosis: Token Counting

from llama_index.core import ServiceContext
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-3.5-turbo")  # 4096 token limit
service_context = ServiceContext.from_defaults(llm=llm)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    service_context=service_context
)

response = query_engine.query("What is the refund policy?")

# Check how much context was used
total_tokens = sum(len(node.get_content().split()) for node in response.source_nodes)
print(f"Retrieved context: ~{total_tokens * 1.3:.0f} tokens")  # 1.3 = avg tokens/word
print(f"LLM limit: 4096 tokens")

if total_tokens * 1.3 > 3000:  # Leave room for prompt + output
    print("⚠️ Context overflow likely")

Fix: Adjust similarity_top_k or Use Compact Mode

query_engine = index.as_query_engine(
    similarity_top_k=5,  # Reduce from 10
    response_mode="compact",  # Auto-fit context window
    service_context=service_context
)

Issue 3: Node Parser Chunking Errors

Problem: Default SentenceSplitter breaks code blocks, tables, or lists across chunks.

Diagnosis: Inspect Nodes

from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)

for i, node in enumerate(nodes[:5]):
    print(f"\n=== Node {i} ===")
    print(f"Text: {node.get_content()[:200]}...")
    print(f"Metadata: {node.metadata}")
    
    # Check for broken code blocks
    if "```" in node.get_content() and node.get_content().count("```") % 2 != 0:
        print("⚠️ WARNING: Code block split across nodes")

Fix: Use Semantic Splitter

from llama_index.core.node_parser import SemanticSplitterNodeParser

# Splits only when semantic similarity drops below threshold
parser = SemanticSplitterNodeParser(
    buffer_size=1,  # Number of sentences to consider for boundary
    breakpoint_percentile_threshold=95  # Split when similarity < 5th percentile
)

nodes = parser.get_nodes_from_documents(documents)

Issue 4: Metadata Extraction Failures

Problem: LlamaIndex auto-extracts metadata (summaries, keywords, questions) but fails on unstructured text.

Diagnosis: Check Extracted Metadata

from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    KeywordExtractor
)

extractors = [
    SummaryExtractor(summaries=["prev", "self"]),
    QuestionsAnsweredExtractor(questions=3),
    KeywordExtractor(keywords=10)
]

from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=extractors)
nodes = pipeline.run(documents=documents)

for node in nodes[:3]:
    print(f"\n=== Node Metadata ===")
    print(f"Summary: {node.metadata.get('section_summary')}")
    print(f"Questions: {node.metadata.get('questions_this_excerpt_can_answer')}")
    print(f"Keywords: {node.metadata.get('excerpt_keywords')}")

Fix: Custom Metadata Extractors

from llama_index.core.extractors import BaseExtractor

class CustomExtractor(BaseExtractor):
    def extract(self, nodes):
        for node in nodes:
            # Add your custom metadata logic
            node.metadata["custom_field"] = extract_custom_info(node.get_content())
        return nodes

pipeline = IngestionPipeline(
    transformations=[CustomExtractor(), SummaryExtractor()]
)

Issue 5: Chat Engine Memory Management

Problem: CondenseQuestionChatEngine stores entire conversation history, causing context overflow after 20+ turns.

Diagnosis: Track Conversation Length

from llama_index.core.chat_engine import CondenseQuestionChatEngine
from llama_index.core.memory import ChatMemoryBuffer

memory = ChatMemoryBuffer.from_defaults(token_limit=2000)
chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine,
    memory=memory
)

for i in range(50):
    response = chat_engine.chat(f"Question {i}")
    
    # Check memory usage
    history = memory.get_all()
    total_tokens = sum(len(msg.content.split()) * 1.3 for msg in history)
    print(f"Turn {i}: {len(history)} messages, ~{total_tokens:.0f} tokens")

Fix: Use Token-Limited Memory

from llama_index.core.memory import ChatMemoryBuffer

# Automatically evict old messages when limit exceeded
memory = ChatMemoryBuffer.from_defaults(
    token_limit=2000,  # Hard limit
    tokenizer_fn=lambda text: len(text.split()) * 1.3  # Rough token count
)

chat_engine = CondenseQuestionChatEngine.from_defaults(
    query_engine=query_engine,
    memory=memory
)

Issue 6: Retriever vs Query Engine Confusion

Problem: LlamaIndex has both retrievers (return nodes) and query engines (return synthesized response). Using wrong one causes type errors.

Diagnosis: Understand Abstractions

# Retriever: Returns nodes (documents)
retriever = index.as_retriever(similarity_top_k=5)
nodes = retriever.retrieve("What is the refund policy?")
print(type(nodes))  # list[NodeWithScore]

# Query Engine: Returns response (string + metadata)
query_engine = index.as_query_engine()
response = query_engine.query("What is the refund policy?")
print(type(response))  # Response object with .response, .source_nodes

Fix: Use Correct Abstraction

# For custom post-processing: use retriever
retriever = index.as_retriever()
nodes = retriever.retrieve(query)
filtered_nodes = [n for n in nodes if n.score > 0.7]
# Then manually synthesize response

# For end-to-end RAG: use query engine
query_engine = index.as_query_engine()
response = query_engine.query(query)
print(response.response)

Issue 7: Vector Store Index vs Document Summary Index

Problem: Choosing wrong index type causes slow queries or irrelevant results.

When to Use Each Index

  • VectorStoreIndex: Semantic search, works for all query types
  • SummaryIndex: Keyword search, fast but less accurate
  • TreeIndex: Hierarchical data, good for "summarize all docs"
  • KeywordTableIndex: Exact keyword matching, fast lookups

Fix: Hybrid Index

from llama_index.core import VectorStoreIndex, SummaryIndex
from llama_index.core.retrievers import QueryFusionRetriever

# Combine semantic + keyword search
vector_retriever = VectorStoreIndex.from_documents(documents).as_retriever()
keyword_retriever = SummaryIndex.from_documents(documents).as_retriever()

retriever = QueryFusionRetriever(
    [vector_retriever, keyword_retriever],
    similarity_top_k=5,
    num_queries=1  # Don't generate multiple query variations
)
🛠️ RAG Debugger — LlamaIndex Native Integration

Stop guessing which query engine mode to use. RAG Debugger provides:

  • 🔬 Mode comparison UI — Test default vs compact vs refine side-by-side
  • 📊 Node inspector — See exactly what each chunk contains
  • 🎯 Response synthesizer tracer — Debug tree_summarize recursion
  • 💾 Memory profiler — Track ChatMemoryBuffer token usage over time
Try RAG Debugger Free →

FAQ

Which query engine mode should I use?

Start with "default" for direct questions. Use "compact" if hitting context limits. Use "tree_summarize" for "summarize these 50 documents" queries. Use "refine" only when accuracy matters more than speed.

How do I debug response synthesizer failures?

Enable verbose logging: set_global_handler("simple"). This prints every LLM call, showing exactly what prompt was sent and what response came back.

Why are my chunks breaking in the middle of code blocks?

Default SentenceSplitter is sentence-aware but not structure-aware. Use SemanticSplitterNodeParser or MarkdownNodeParser (for markdown files) to respect document structure.

How do I prevent chat engine context overflow?

Use ChatMemoryBuffer with token_limit=2000. This auto-evicts old messages when limit exceeded. For long conversations, switch to ConversationSummaryMemory to compress history.