LangChain RAG Debugging Guide — RetrievalQA, LCEL, and Async Issues

Overview

LangChain RAG has framework-specific failure modes. This guide covers debugging techniques for RetrievalQA, ConversationalRetrievalChain, LCEL pipelines, and async operations.

Issue 1: RetrievalQA Black Box Debugging

Problem: RetrievalQA is a convenience wrapper that hides what's happening inside. When it fails, you don't know if retrieval, prompt construction, or LLM generation is broken.

Diagnosis: Add Verbose Logging

from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    verbose=True,  # Print intermediate steps
    callbacks=[StdOutCallbackHandler()]
)

result = qa({"query": "What is the refund policy?"})
print(result)

Fix: Use LCEL for Transparency

from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate

# Explicit pipeline instead of RetrievalQA black box
prompt = ChatPromptTemplate.from_template(
    "Answer based on context:\n{context}\n\nQuestion: {question}"
)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# Now you can inspect each stage
docs = retriever.get_relevant_documents("What is the refund policy?")
print(f"Retrieved {len(docs)} docs")
formatted_prompt = prompt.format(context=docs, question="What is the refund policy?")
print(f"Prompt:\n{formatted_prompt}")
answer = chain.invoke("What is the refund policy?")
print(f"Answer: {answer}")

Issue 2: ConversationalRetrievalChain Memory Leaks

Problem: ConversationalRetrievalChain stores entire conversation history in memory. After 50+ turns, context window overflows or RAM usage explodes.

Diagnosis: Monitor Memory Growth

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)

for i in range(100):
    result = qa({"question": f"Question {i}"})
    
    # Check memory size
    history = memory.load_memory_variables({})
    print(f"Turn {i}: {len(history['chat_history'])} messages in memory")

Fix: Use ConversationSummaryMemory

from langchain.memory import ConversationSummaryMemory

# Summarize old messages instead of storing verbatim
memory = ConversationSummaryMemory(
    llm=llm,
    memory_key="chat_history",
    return_messages=True,
    max_token_limit=1000  # Compress when history exceeds 1K tokens
)

qa = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)

Issue 3: LCEL Streaming Failures

Problem: LCEL chains with .stream() don't stream retrieval results, only LLM output. Users see blank screen for 5+ seconds while retrieval happens.

Diagnosis: Check What's Actually Streaming

from langchain.schema.runnable import RunnablePassthrough

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

import time
start = time.time()
for chunk in chain.stream("What is the refund policy?"):
    print(f"[{time.time() - start:.2f}s] {chunk}", end="", flush=True)

# Output:
# [5.2s] Based  <-- 5 second delay before first token (retrieval + prompt)
# [5.3s] on the
# [5.4s] provided...

Fix: Async Retrieval + Streaming

import asyncio
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

async def stream_with_retrieval(query):
    # Start retrieval immediately
    docs = await retriever.aget_relevant_documents(query)
    print(f"✅ Retrieved {len(docs)} docs")
    
    # Stream LLM response
    formatted_prompt = prompt.format(context=docs, question=query)
    async for chunk in llm.astream(formatted_prompt, callbacks=[StreamingStdOutCallbackHandler()]):
        yield chunk

async for chunk in stream_with_retrieval("What is the refund policy?"):
    pass  # Chunk already printed by callback

Issue 4: Async Retriever Deadlocks

Problem: Mixing sync and async code causes RuntimeError: This event loop is already running or silent hangs.

Diagnosis: Check Call Stack

# ❌ BROKEN: Calling async from sync context
def broken_rag(query):
    docs = retriever.aget_relevant_documents(query)  # Returns coroutine, not docs!
    # This WILL fail because docs is  not list
    return llm(format_prompt(docs, query))

# ❌ BROKEN: Running event loop inside event loop
async def also_broken(query):
    loop = asyncio.get_event_loop()
    docs = loop.run_until_complete(retriever.aget_relevant_documents(query))
    # RuntimeError: This event loop is already running

Fix: Stay Fully Async or Fully Sync

# ✅ CORRECT: Fully async
async def async_rag(query):
    docs = await retriever.aget_relevant_documents(query)
    answer = await llm.apredict(format_prompt(docs, query))
    return answer

# ✅ CORRECT: Fully sync
def sync_rag(query):
    docs = retriever.get_relevant_documents(query)  # Sync version
    answer = llm.predict(format_prompt(docs, query))
    return answer

Issue 5: Callback Handler Conflicts

Problem: Multiple callback handlers (LangSmith tracing + custom logging + streaming) step on each other, causing duplicate logs or missing traces.

Diagnosis: Print Active Callbacks

from langchain.callbacks.manager import get_callback_manager

callbacks = get_callback_manager()
print(f"Active callbacks: {callbacks}")

# Run chain and check which callbacks fire
result = chain.invoke(
    "test query",
    config={"callbacks": [StdOutCallbackHandler()]}
)
# If you see duplicate output, callbacks are conflicting

Fix: Use Explicit Callback Groups

from langchain.callbacks import CallbackManager
from langsmith import Client

# Separate callback concerns
tracing_callbacks = CallbackManager([Client().get_tracing_callback()])
logging_callbacks = CallbackManager([StdOutCallbackHandler()])

# Use per-invocation callbacks instead of global
result = chain.invoke(
    "test query",
    config={"callbacks": tracing_callbacks}  # Only trace this call
)

Issue 6: Retriever Filter Syntax Variations

Problem: Different vector stores use different metadata filter syntax. Code breaks when switching from Pinecone to Qdrant.

Diagnosis: Check Vector Store Docs

# Pinecone syntax
pinecone_filter = {"tenant_id": {"$eq": "acme"}}

# Qdrant syntax
qdrant_filter = {"must": [{"key": "tenant_id", "match": {"value": "acme"}}]}

# Weaviate syntax (GraphQL)
weaviate_filter = {"path": ["tenant_id"], "operator": "Equal", "valueString": "acme"}

Fix: Abstraction Layer

def build_filter(tenant_id, vector_store_type):
    """Unified filter builder"""
    filters = {
        "pinecone": {"tenant_id": {"$eq": tenant_id}},
        "qdrant": {"must": [{"key": "tenant_id", "match": {"value": tenant_id}}]},
        "weaviate": {"path": ["tenant_id"], "operator": "Equal", "valueString": tenant_id}
    }
    return filters[vector_store_type]

retriever = vectorstore.as_retriever(
    search_kwargs={"filter": build_filter("acme", "pinecone")}
)

Issue 7: Document Loader Encoding Failures

Problem: PyPDFLoader returns garbled text for scanned PDFs, TextLoader fails on non-UTF8 files.

Diagnosis: Inspect Loaded Documents

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("scanned.pdf")
docs = loader.load()

for doc in docs[:3]:
    print(f"Page content: {doc.page_content[:200]}")
    # If you see: ����� or empty strings, OCR failed

Fix: Use OCR-Aware Loaders

from langchain.document_loaders import UnstructuredPDFLoader

# Automatically runs Tesseract OCR on scanned pages
loader = UnstructuredPDFLoader(
    "scanned.pdf",
    mode="elements",  # Preserve structure (headings, tables)
    strategy="ocr_only"  # Force OCR even if text layer exists
)
docs = loader.load()

🛠️ RAG Debugger — LangChain Integration Built-In

Debugging LangChain RAG shouldn't require reading framework source code. RAG Debugger supports:

🔗 LangChain callback hook — Auto-capture all chain executions
🎯 LCEL visualization — See your chain DAG in real-time
🔍 Memory inspection — Track ConversationBufferMemory growth
⚡ Async tracer — Debug retriever deadlocks with call stack viz

Try RAG Debugger Free →

FAQ

Should I use RetrievalQA or build my own LCEL chain?

Use LCEL for production. RetrievalQA is great for demos but too opaque for debugging. LCEL gives you full control over prompt construction and retrieval parameters.

How do I debug ConversationalRetrievalChain context overflow?

Switch from ConversationBufferMemory (stores everything) to ConversationSummaryMemory (compresses old messages) or ConversationTokenBufferMemory (hard token limit).

Why does my LCEL chain freeze with async retrievers?

You're likely mixing sync and async code. Either go fully async (await retriever.aget_relevant_documents()) or fully sync (retriever.get_relevant_documents()). Never call async from sync.

How do I trace LangChain chains in production?

Use LangSmith for built-in tracing, or add custom callbacks that log to your observability stack (Datadog, Honeycomb). Avoid print statements in prod.

Deploy Your LangChain App — Recommended Hosting

🌐

Hostinger

Web Hosting from $2.99/mo

💧

DigitalOcean

$200 Free Credit