Overview
LangChain RAG has framework-specific failure modes. This guide covers debugging techniques for RetrievalQA, ConversationalRetrievalChain, LCEL pipelines, and async operations.
Issue 1: RetrievalQA Black Box Debugging
Problem: RetrievalQA is a convenience wrapper that hides what's happening inside. When it fails, you don't know if retrieval, prompt construction, or LLM generation is broken.
Diagnosis: Add Verbose Logging
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
verbose=True, # Print intermediate steps
callbacks=[StdOutCallbackHandler()]
)
result = qa({"query": "What is the refund policy?"})
print(result)
Fix: Use LCEL for Transparency
from langchain.schema.runnable import RunnablePassthrough
from langchain.prompts import ChatPromptTemplate
# Explicit pipeline instead of RetrievalQA black box
prompt = ChatPromptTemplate.from_template(
"Answer based on context:\n{context}\n\nQuestion: {question}"
)
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
# Now you can inspect each stage
docs = retriever.get_relevant_documents("What is the refund policy?")
print(f"Retrieved {len(docs)} docs")
formatted_prompt = prompt.format(context=docs, question="What is the refund policy?")
print(f"Prompt:\n{formatted_prompt}")
answer = chain.invoke("What is the refund policy?")
print(f"Answer: {answer}")
Issue 2: ConversationalRetrievalChain Memory Leaks
Problem: ConversationalRetrievalChain stores entire conversation history in memory. After 50+ turns, context window overflows or RAM usage explodes.
Diagnosis: Monitor Memory Growth
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
qa = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)
for i in range(100):
result = qa({"question": f"Question {i}"})
# Check memory size
history = memory.load_memory_variables({})
print(f"Turn {i}: {len(history['chat_history'])} messages in memory")
Fix: Use ConversationSummaryMemory
from langchain.memory import ConversationSummaryMemory
# Summarize old messages instead of storing verbatim
memory = ConversationSummaryMemory(
llm=llm,
memory_key="chat_history",
return_messages=True,
max_token_limit=1000 # Compress when history exceeds 1K tokens
)
qa = ConversationalRetrievalChain.from_llm(llm, retriever, memory=memory)
Issue 3: LCEL Streaming Failures
Problem: LCEL chains with .stream() don't stream retrieval results, only LLM output. Users see blank screen for 5+ seconds while retrieval happens.
Diagnosis: Check What's Actually Streaming
from langchain.schema.runnable import RunnablePassthrough
chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
)
import time
start = time.time()
for chunk in chain.stream("What is the refund policy?"):
print(f"[{time.time() - start:.2f}s] {chunk}", end="", flush=True)
# Output:
# [5.2s] Based <-- 5 second delay before first token (retrieval + prompt)
# [5.3s] on the
# [5.4s] provided...
Fix: Async Retrieval + Streaming
import asyncio
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
async def stream_with_retrieval(query):
# Start retrieval immediately
docs = await retriever.aget_relevant_documents(query)
print(f"✅ Retrieved {len(docs)} docs")
# Stream LLM response
formatted_prompt = prompt.format(context=docs, question=query)
async for chunk in llm.astream(formatted_prompt, callbacks=[StreamingStdOutCallbackHandler()]):
yield chunk
async for chunk in stream_with_retrieval("What is the refund policy?"):
pass # Chunk already printed by callback
Issue 4: Async Retriever Deadlocks
Problem: Mixing sync and async code causes RuntimeError: This event loop is already running or silent hangs.
Diagnosis: Check Call Stack
# ❌ BROKEN: Calling async from sync context
def broken_rag(query):
docs = retriever.aget_relevant_documents(query) # Returns coroutine, not docs!
# This WILL fail because docs is not list
return llm(format_prompt(docs, query))
# ❌ BROKEN: Running event loop inside event loop
async def also_broken(query):
loop = asyncio.get_event_loop()
docs = loop.run_until_complete(retriever.aget_relevant_documents(query))
# RuntimeError: This event loop is already running
Fix: Stay Fully Async or Fully Sync
# ✅ CORRECT: Fully async
async def async_rag(query):
docs = await retriever.aget_relevant_documents(query)
answer = await llm.apredict(format_prompt(docs, query))
return answer
# ✅ CORRECT: Fully sync
def sync_rag(query):
docs = retriever.get_relevant_documents(query) # Sync version
answer = llm.predict(format_prompt(docs, query))
return answer
Issue 5: Callback Handler Conflicts
Problem: Multiple callback handlers (LangSmith tracing + custom logging + streaming) step on each other, causing duplicate logs or missing traces.
Diagnosis: Print Active Callbacks
from langchain.callbacks.manager import get_callback_manager
callbacks = get_callback_manager()
print(f"Active callbacks: {callbacks}")
# Run chain and check which callbacks fire
result = chain.invoke(
"test query",
config={"callbacks": [StdOutCallbackHandler()]}
)
# If you see duplicate output, callbacks are conflicting
Fix: Use Explicit Callback Groups
from langchain.callbacks import CallbackManager
from langsmith import Client
# Separate callback concerns
tracing_callbacks = CallbackManager([Client().get_tracing_callback()])
logging_callbacks = CallbackManager([StdOutCallbackHandler()])
# Use per-invocation callbacks instead of global
result = chain.invoke(
"test query",
config={"callbacks": tracing_callbacks} # Only trace this call
)
Issue 6: Retriever Filter Syntax Variations
Problem: Different vector stores use different metadata filter syntax. Code breaks when switching from Pinecone to Qdrant.
Diagnosis: Check Vector Store Docs
# Pinecone syntax
pinecone_filter = {"tenant_id": {"$eq": "acme"}}
# Qdrant syntax
qdrant_filter = {"must": [{"key": "tenant_id", "match": {"value": "acme"}}]}
# Weaviate syntax (GraphQL)
weaviate_filter = {"path": ["tenant_id"], "operator": "Equal", "valueString": "acme"}
Fix: Abstraction Layer
def build_filter(tenant_id, vector_store_type):
"""Unified filter builder"""
filters = {
"pinecone": {"tenant_id": {"$eq": tenant_id}},
"qdrant": {"must": [{"key": "tenant_id", "match": {"value": tenant_id}}]},
"weaviate": {"path": ["tenant_id"], "operator": "Equal", "valueString": tenant_id}
}
return filters[vector_store_type]
retriever = vectorstore.as_retriever(
search_kwargs={"filter": build_filter("acme", "pinecone")}
)
Issue 7: Document Loader Encoding Failures
Problem: PyPDFLoader returns garbled text for scanned PDFs, TextLoader fails on non-UTF8 files.
Diagnosis: Inspect Loaded Documents
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("scanned.pdf")
docs = loader.load()
for doc in docs[:3]:
print(f"Page content: {doc.page_content[:200]}")
# If you see: ����� or empty strings, OCR failed
Fix: Use OCR-Aware Loaders
from langchain.document_loaders import UnstructuredPDFLoader
# Automatically runs Tesseract OCR on scanned pages
loader = UnstructuredPDFLoader(
"scanned.pdf",
mode="elements", # Preserve structure (headings, tables)
strategy="ocr_only" # Force OCR even if text layer exists
)
docs = loader.load()
Debugging LangChain RAG shouldn't require reading framework source code. RAG Debugger supports:
- 🔗 LangChain callback hook — Auto-capture all chain executions
- 🎯 LCEL visualization — See your chain DAG in real-time
- 🔍 Memory inspection — Track ConversationBufferMemory growth
- ⚡ Async tracer — Debug retriever deadlocks with call stack viz
FAQ
Should I use RetrievalQA or build my own LCEL chain?
Use LCEL for production. RetrievalQA is great for demos but too opaque for debugging. LCEL gives you full control over prompt construction and retrieval parameters.
How do I debug ConversationalRetrievalChain context overflow?
Switch from ConversationBufferMemory (stores everything) to ConversationSummaryMemory (compresses old messages) or ConversationTokenBufferMemory (hard token limit).
Why does my LCEL chain freeze with async retrievers?
You're likely mixing sync and async code. Either go fully async (await retriever.aget_relevant_documents()) or fully sync (retriever.get_relevant_documents()). Never call async from sync.
How do I trace LangChain chains in production?
Use LangSmith for built-in tracing, or add custom callbacks that log to your observability stack (Datadog, Honeycomb). Avoid print statements in prod.