LlamaIndex RAG Debugging Guide

Troubleshoot node parsing, query engines, retrieval, and response synthesis

Published: March 13, 202610 min read

LlamaIndex provides powerful abstractions for RAG applications. But when your query engine returns poor results, debugging can be frustrating.

Is the issue with node parsing? Retriever configuration? Response synthesizer? Without understanding LlamaIndex's architecture, you're guessing.

This guide covers common LlamaIndex RAG issues with working fixes.

🔧 Visual RAG Debugging

Use rag-debugger.pages.dev to visualize LlamaIndex query outputs. Paste retrieved nodes and responses to identify failures. Free: 10 sessions/month.

LlamaIndex RAG Architecture

Understanding the data flow helps isolate issues:

Documents → NodeParser → Nodes → VectorStore
                                    ↓
Query → QueryEngine → Retriever → Nodes → ResponseSynthesizer → Response

Each stage can introduce problems. Let's debug them systematically.

Issue 1: Node Parsing Problems

Problem: Chunks are too small, too large, or split at wrong boundaries

Symptoms:

Debug:

from llama_index.core import Document
from llama_index.core.node_parser import SimpleNodeParser

# Create parser
parser = SimpleNodeParser.from_defaults(
    chunk_size=512,
    chunk_overlap=50,
    separator="\n"
)

# Parse and inspect
nodes = parser.get_nodes_from_documents(documents)

print(f"Total nodes: {len(nodes)}")
print(f"Avg chunk size: {sum(len(n.text) for n in nodes) / len(nodes):.0f} chars")

# Inspect first few nodes
for i, node in enumerate(nodes[:5]):
    print(f"\n--- Node {i+1} ---")
    print(f"Size: {len(node.text)} chars")
    print(f"Metadata: {node.metadata}")
    print(f"Content preview: {node.text[:200]}...")

Fixes:

# Fix 1: Use semantic chunking for better boundaries
from llama_index.core.node_parser import SemanticSplitterNodeParser
from llama_index.embeddings.openai import OpenAIEmbedding

semantic_parser = SemanticSplitterNodeParser(
    buffer_size=1,
    breakpoint_percentile_threshold=95,
    embed_model=OpenAIEmbedding()
)

nodes = semantic_parser.get_nodes_from_documents(documents)

# Fix 2: Use code-aware parser for technical docs
from llama_index.core.node_parser import CodeSplitter

code_parser = CodeSplitter(
    language="python",  # or "javascript", "java", etc.
    chunk_lines=50,
    chunk_lines_overlap=10
)

# Fix 3: Use hierarchical parsing for long documents
from llama_index.core.node_parser import HierarchicalNodeParser

hierarchical_parser = HierarchicalNodeParser.from_chunk_sizes(
    chunk_sizes=[2048, 512, 128]  # Parent → child sizes
)

nodes = hierarchical_parser.get_nodes_from_documents(documents)
# Returns nodes at multiple levels for parent-child retrieval

Issue 2: Retriever Returns Poor Results

Problem: Top-K nodes are irrelevant or low quality

Debug:

from llama_index.core import VectorStoreIndex
from llama_index.core.postprocessor import SimilarityPostprocessor

# Create index and retriever
index = VectorStoreIndex(nodes)
retriever = index.as_retriever(
    similarity_top_k=10,
    similarity_cutoff=0.5  # Filter low scores
)

# Test retrieval
query = "your test query"
nodes = retriever.retrieve(query)

print(f"Retrieved {len(nodes)} nodes\n")
for i, node in enumerate(nodes):
    print(f"{i+1}. Score: {node.score:.3f}")
    print(f"   Text: {node.node.text[:150]}...")
    print(f"   Metadata: {node.node.metadata}\n")

# Check score distribution
scores = [n.score for n in nodes]
print(f"Score stats: min={min(scores):.3f}, max={max(scores):.3f}, avg={sum(scores)/len(scores):.3f}")

Fixes:

# Fix 1: Adjust similarity cutoff
retriever = index.as_retriever(
    similarity_top_k=10,
    similarity_cutoff=0.3  # Lower for recall, higher for precision
)

# Fix 2: Use MMR for diversity
from llama_index.core.retrievers import RecursiveRetriever

retriever = index.as_retriever(
    retriever_mode="hybrid",  # Combines dense + sparse
    hybrid_top_k=10
)

# Fix 3: Add reranking
from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-base",
    top_n=5
)

query_engine = index.as_query_engine(
    postprocessors=[reranker],
    similarity_top_k=20  # Retrieve more, rerank to top 5
)

# Fix 4: Use auto-retrieval for large corpora
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import FusionRetriever

bm25 = BM25Retriever.from_defaults(nodes, similarity_top_k=5)
vector = index.as_retriever(similarity_top_k=5)

fusion_retriever = FusionRetriever(
    [bm25, vector],
    num_top_k=10,
    fusion_weights=[0.3, 0.7]  # BM25 + dense
)

Issue 3: Response Synthesis Problems

Problem: Answer ignores retrieved context or hallucinates

Debug:

from llama_index.core import get_response_synthesizer

# Check current synthesizer
synthesizer = get_response_synthesizer(llm=llm)

response = synthesizer.synthesize(
    query="your query",
    nodes=retrieved_nodes
)

print("Response:")
print(response)
print("\nSource nodes:")
for node in response.source_nodes:
    print(f"- Score: {node.score:.3f}: {node.node.text[:100]}...")

Fixes:

# Fix 1: Use refine mode for better grounding
from llama_index.core.response_synthesizers import ResponseMode

query_engine = index.as_query_engine(
    response_mode="refine",  # Iteratively refine with each node
    similarity_top_k=5
)

# Fix 2: Custom system prompt for grounding
from llama_index.core import get_response_synthesizer
from llama_index.core.prompts import PromptTemplate

grounded_template = PromptTemplate("""
<|system|>
You are an assistant that answers questions based ONLY on the provided context.
- If the answer is not in the context, say "I don't have enough information."
- Cite sources using [Source X] notation.
- Quote exact passages when making factual claims.

Context:
{context_str}

<|user|>
{query_str}

<|assistant|>
""")

synthesizer = get_response_synthesizer(
    llm=llm,
    text_qa_template=grounded_template,
    response_mode="compact"
)

# Fix 3: Add citation requirements
query_engine = index.as_query_engine(
    synthesizer=synthesizer,
    node_postprocessors=[
        # Add citation metadata
        CitationPostprocessor()
    ]
)

# Fix 4: Lower temperature for factual queries
from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-3.5-turbo",
    temperature=0,  # Deterministic
    max_tokens=1000
)

Issue 4: Query Engine Configuration

Problem: Wrong query engine type for use case

Query Engine Comparison:

Type Best For Config
VectorQueryEngine Standard RAG index.as_query_engine()
RetrieverQueryEngine Custom retrievers RetrieverQueryEngine(retriever, synthesizer)
SubQuestionQueryEngine Multi-hop reasoning SubQuestionQueryEngine.from_defaults()
CitationQueryEngine Verified answers with citations CitationQueryEngine.from_defaults()

Example: Multi-Hop with Sub-Questions

from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool

# Create query engine tools for each data source
product_engine = product_index.as_query_engine()
pricing_engine = pricing_index.as_query_engine()

tools = [
    QueryEngineTool.from_defaults(
        query_engine=product_engine,
        name="product_info",
        description="Product specifications and features"
    ),
    QueryEngineTool.from_defaults(
        query_engine=pricing_engine,
        name="pricing_info",
        description="Pricing tiers and costs"
    )
]

# Sub-question engine will decompose complex queries
query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=tools,
    llm=llm,
    verbose=True
)

response = query_engine.query("What's the price of the Pro plan with advanced features?")
# Automatically asks: "What is the Pro plan?" → "What are advanced features?" → "What's the price?"

Issue 5: Metadata Filtering Not Working

Problem: Filters don't narrow results as expected

Debug:

from llama_index.core import VectorStoreIndex
from llama_index.core.vector_stores import MetadataInfo, VectorStoreInfo

# Check what metadata is available
vector_store_info = VectorStoreInfo(
    metadata_info=[
        MetadataInfo(name="doc_type", type="str", description="Type of document"),
        MetadataInfo(name="version", type="str", description="Document version"),
        MetadataInfo(name="created_at", type="int", description="Unix timestamp"),
    ]
)

# Inspect node metadata
for node in nodes[:3]:
    print(f"Node metadata: {node.metadata}")

# Test filter
from llama_index.core.vector_stores import MetadataFilters, FilterCondition

filters = MetadataFilters(
    filters=[
        {"key": "doc_type", "value": "api_docs", "operator": "=="},
        {"key": "version", "value": "2.0", "operator": ">="}
    ],
    condition=FilterCondition.AND
)

retriever = index.as_retriever(filters=filters)
results = retriever.retrieve("your query")
print(f"Filtered results: {len(results)}")

Fixes:

# Fix 1: Ensure metadata is indexed
from llama_index.core import StorageContext, VectorStoreIndex

# Re-index with metadata
for node in nodes:
    # Ensure metadata is serializable
    node.metadata = {
        k: str(v) if not isinstance(v, (str, int, float, bool)) else v
        for k, v in node.metadata.items()
    }

# Rebuild index
index = VectorStoreIndex(nodes)

# Fix 2: Use metadata + similarity combined
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="doc_type", value="api_docs"),
        {"key": "similarity", "value": 0.5, "operator": ">="}  # Score threshold
    ]
)

# Fix 3: Post-filter if vectorstore doesn't support pre-filtering
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

query_engine = index.as_query_engine(
    postprocessors=[
        MetadataReplacementPostProcessor.from_filters(filters)
    ]
)

Debugging Utilities

Utility 1: Query Trace Logger

import json
from datetime import datetime
from llama_index.core.callbacks import CallbackManager, BaseCallbackHandler

class QueryTraceHandler(BaseCallbackHandler):
    def __init__(self, log_file: str = "query_traces.jsonl"):
        super().__init__()
        self.log_file = log_file
        self.current_query = None
        self.retrieved_nodes = []
        self.response = None

    def on_query_start(self, query_str: str, **kwargs):
        self.current_query = query_str
        self.retrieved_nodes = []

    def on_retrieval_end(self, nodes, **kwargs):
        self.retrieved_nodes = [
            {"text": n.node.text[:200], "score": n.score, "metadata": n.node.metadata}
            for n in nodes
        ]

    def on_query_end(self, response, **kwargs):
        trace = {
            "timestamp": datetime.now().isoformat(),
            "query": self.current_query,
            "retrieved_nodes": self.retrieved_nodes,
            "response": str(response),
            "source_count": len(response.source_nodes) if hasattr(response, 'source_nodes') else 0
        }

        with open(self.log_file, "a") as f:
            f.write(json.dumps(trace) + "\n")

# Usage
trace_handler = QueryTraceHandler()
callback_manager = CallbackManager([trace_handler])

query_engine = index.as_query_engine(callback_manager=callback_manager)
response = query_engine.query("your query")

Utility 2: Response Quality Checker

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class ResponseQualityChecker:
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def check(self, query: str, response: str, source_nodes: list) -> dict:
        """Check response quality metrics"""

        # 1. Query-Response relevance
        query_response_sim = cosine_similarity(
            self.model.encode([query]),
            self.model.encode([response])
        )[0][0]

        # 2. Source coverage (does response use sources?)
        response_sentences = [s.strip() for s in response.split('.') if len(s) > 10]
        source_texts = [n.node.text for n in source_nodes]

        covered = 0
        for sent in response_sentences:
            max_sim = max(
                cosine_similarity(
                    self.model.encode([sent]),
                    self.model.encode(source_texts)
                )[0]
            )
            if max_sim > 0.5:
                covered += 1

        coverage_rate = covered / len(response_sentences) if response_sentences else 0

        # 3. Confidence (based on source scores)
        source_scores = [n.score for n in source_nodes] if source_nodes else [0]
        avg_confidence = np.mean(source_scores)

        return {
            "query_response_similarity": query_response_sim,
            "source_coverage": coverage_rate,
            "confidence": avg_confidence,
            "quality_score": (query_response_sim + coverage_rate + avg_confidence) / 3
        }

# Usage
checker = ResponseQualityChecker()
quality = checker.check(query, str(response), response.source_nodes)

print(f"Quality Score: {quality['quality_score']:.2f}")
print(f"  - Query-Response Sim: {quality['query_response_similarity']:.2f}")
print(f"  - Source Coverage: {quality['source_coverage']:.2f}")
print(f"  - Confidence: {quality['confidence']:.2f}")

if quality['quality_score'] < 0.5:
    print("⚠️  Low quality response - consider retrieving more/better sources")

🚀 Visual RAG Debugging for LlamaIndex

RAG Debugger works with LlamaIndex:

Try 10 free debug sessions → rag-debugger.pages.dev

Quick Reference: Configuration Guide

Recommended Starting Config:
from llama_index.core import Settings, VectorStoreIndex
from llama_index.core.postprocessor import SentenceTransformerRerank
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Settings
Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
Settings.embed_model = OpenAIEmbedding()
Settings.chunk_size = 512
Settings.chunk_overlap = 50

# Index with reranking
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[
        SentenceTransformerRerank(model="BAAI/bge-reranker-base", top_n=5)
    ],
    response_mode="refine"
)

Conclusion

Debugging LlamaIndex RAG requires understanding each component:

  1. Node parsing: Use semantic or hierarchical parsers for better chunks
  2. Retrieval: Tune similarity cutoff, use hybrid or MMR for diversity
  3. Response synthesis: Use refine mode with grounded prompts
  4. Query engine selection: Match engine type to use case
  5. Metadata filtering: Ensure proper indexing and filter syntax

For faster debugging, try RAG Debugger — a visual tool that analyzes LlamaIndex traces and detects failures automatically. Start with 10 free sessions at rag-debugger.pages.dev.