Introduction
RAG (Retrieval-Augmented Generation) evaluation is harder than standard LLM evaluation because you're evaluating two coupled systems: the retriever and the generator. Your retriever might find perfect chunks that the LLM ignores. Or your LLM might synthesize beautiful answers from garbage context. Traditional metrics like BLEU or ROUGE miss this entirely. This guide covers the metrics and workflows that actually predict production RAG quality.
The Four Core Metrics
RAG quality decomposes into four independent dimensions. You need all four — optimizing one at the expense of another leads to broken systems.
1. Context Precision
Definition: What fraction of retrieved chunks are relevant to the query?
Context Precision = (Relevant Retrieved Chunks) / (Total Retrieved Chunks)
Why it matters: Low precision means you're polluting the LLM's context window with noise, wasting tokens and degrading answer quality.
How to measure: For each query, manually label retrieved chunks as relevant/irrelevant. Precision is the ratio.
Target: >0.8 for production systems. Below 0.6 indicates retrieval is broken.
2. Context Recall
Definition: What fraction of ground-truth relevant chunks were retrieved?
Context Recall = (Relevant Retrieved Chunks) / (All Relevant Chunks in KB)
Why it matters: Low recall means you're missing critical information, leading to incomplete or wrong answers.
How to measure: Requires a labeled test set with known relevant chunks per query.
Target: >0.9 for high-stakes applications (medical, legal). >0.7 is acceptable for general use.
3. Faithfulness (Groundedness)
Definition: Is the generated answer supported by the retrieved context?
Faithfulness = (Statements in Answer Supported by Context) / (Total Statements)
Why it matters: RAG's whole point is grounding answers in your knowledge base. If the LLM hallucinates despite having context, RAG failed.
How to measure: Use an LLM-as-judge to check if each sentence in the answer can be verified from retrieved chunks.
Target: >0.95 for production. Anything below 0.9 indicates the LLM isn't using context properly.
4. Answer Relevancy
Definition: Does the generated answer actually address the user's query?
Answer Relevancy = cosine_similarity(embed(query), embed(answer))
Why it matters: High faithfulness + low relevancy = the LLM is faithfully regurgitating irrelevant context.
How to measure: Embed the question and answer, compute cosine similarity. Or use LLM-as-judge with a binary "does this answer the question?" prompt.
Target: >0.8 for production. Below 0.6 means the answer is off-topic.
The Evaluation Matrix
These four metrics create a diagnostic matrix. Here's what each failure mode looks like:
| Precision | Recall | Faithfulness | Relevancy | Diagnosis |
|---|---|---|---|---|
| Low | High | High | Low | Retriever returns too many chunks, noise drowns signal |
| High | Low | High | Low | Missing critical chunks, embeddings are off |
| High | High | Low | Low | LLM ignores context, tune system prompt |
| High | High | High | Low | Query-context mismatch, wrong intent detection |
LLM-as-Judge Implementation
Manual labeling doesn't scale. Use an LLM to evaluate RAG outputs. Here's a production-grade faithfulness judge:
FAITHFULNESS_PROMPT = """
You are evaluating whether an AI-generated answer is faithful to the provided context.
Context:
{context}
Generated Answer:
{answer}
Task: For each statement in the answer, determine if it is:
1. SUPPORTED: Directly stated or clearly implied by the context
2. UNSUPPORTED: Not mentioned in the context (hallucination)
3. CONTRADICTED: Conflicts with information in the context
Output JSON:
{{
"statements": [
{{"text": "...", "verdict": "SUPPORTED|UNSUPPORTED|CONTRADICTED", "evidence": "..."}}
],
"faithfulness_score": 0.0-1.0
}}
"""
def evaluate_faithfulness(context: str, answer: str) -> dict:
response = llm.invoke(
FAITHFULNESS_PROMPT.format(context=context, answer=answer),
response_format={"type": "json_object"}
)
return json.loads(response.content)
Run this on 100-500 query samples to get a statistically valid estimate of production faithfulness.
Retrieval Quality Deep Dive
RAG failures usually trace to bad retrieval. Here's how to diagnose:
Embedding Inspection
Manually inspect retrieved chunks for 20-30 test queries. Look for:
- Semantic drift: Chunks that match keywords but wrong intent
- Recency bias: Old docs outrank new docs (embeddings don't capture time)
- Chunk boundary failures: Relevant info split across chunks, neither scores high enough
Similarity Score Distribution
Plot the distribution of top-K similarity scores:
import matplotlib.pyplot as plt
scores = [result['score'] for result in retrieval_results]
plt.hist(scores, bins=50)
plt.xlabel('Similarity Score')
plt.ylabel('Frequency')
plt.title('Retrieved Chunk Similarity Distribution')
Healthy distributions are bimodal: a cluster of high-scoring relevant chunks (>0.8) and a tail of low-scoring noise (<0.5). A uniform distribution indicates embeddings aren't discriminating.
Reranking Impact
Measure how much reranking improves retrieval:
baseline_precision = evaluate(retriever, top_k=10)
reranked_precision = evaluate(retriever + reranker, top_k=10)
improvement = (reranked_precision - baseline_precision) / baseline_precision
If reranking improves precision by <10%, either your embeddings are already good or your reranker is weak.
End-to-End Testing
Build a regression test suite with known-good query-answer pairs:
TEST_CASES = [
{
"query": "What are the side effects of ibuprofen?",
"expected_chunks": ["doc123", "doc456"],
"expected_answer_contains": ["stomach upset", "bleeding risk"],
"faithfulness_threshold": 0.95
},
# ... 50-100 test cases
]
def run_rag_tests():
results = []
for test in TEST_CASES:
retrieved = retriever.search(test['query'], top_k=5)
answer = generator.generate(test['query'], retrieved)
# Check retrieval
recall = len(set(retrieved) & set(test['expected_chunks'])) / len(test['expected_chunks'])
# Check answer
faithfulness = evaluate_faithfulness(retrieved, answer)
contains_expected = all(kw in answer.lower() for kw in test['expected_answer_contains'])
results.append({
'query': test['query'],
'recall': recall,
'faithfulness': faithfulness,
'contains_expected': contains_expected,
'passed': recall > 0.7 and faithfulness > test['faithfulness_threshold'] and contains_expected
})
return results
Run this on every deploy. If pass rate drops below 90%, rollback.
Debugging Failing Retrievals
When a query returns bad results:
1. Check Query Embedding
Find nearest neighbors to the query embedding in your vector DB:
query_embedding = embed(query)
neighbors = vector_db.search(query_embedding, k=20)
print([chunk['text'] for chunk in neighbors])
If top results are irrelevant, your embedding model doesn't understand the query domain.
2. Inspect Retrieved Chunks
Look at the exact text and metadata of retrieved chunks:
for chunk in retrieved_chunks:
print(f"Score: {chunk['score']:.3f}")
print(f"Text: {chunk['text'][:200]}...")
print(f"Metadata: {chunk['metadata']}")
print("---")
Check for chunking artifacts (truncated sentences, missing context).
3. Test Hypothetical Document Embeddings (HyDE)
Generate a hypothetical answer and embed that instead of the query:
hypothetical_answer = llm.invoke(f"Generate a detailed answer to: {query}")
hyde_embedding = embed(hypothetical_answer)
hyde_results = vector_db.search(hyde_embedding, k=10)
Compare HyDE results to direct query results. If HyDE is much better, your queries are too short or poorly phrased.
For interactive debugging of RAG retrieval issues, tools like RAG Debugger provide real-time visualization of chunk retrieval, similarity scores, and LLM response analysis.
Production Monitoring
Track these metrics in production:
Per-Query Metrics (Logged)
- Number of chunks retrieved
- Top similarity score
- Average similarity score
- Retrieval latency (p50, p95, p99)
- Generation latency
Aggregate Metrics (Dashboarded)
- Daily average top-1 similarity score (should be stable)
- Fraction of queries with 0 results (should be <5%)
- Faithfulness score on sampled queries (weekly batch eval)
Alerts
- Alert if average top-1 score drops >10% (embedding drift or index corruption)
- Alert if 0-result rate spikes (index unavailable or query preprocessing broken)
- Alert if p95 retrieval latency >500ms (performance degradation)
Conclusion
RAG evaluation requires measuring both retrieval (precision, recall) and generation (faithfulness, relevancy). Use LLM-as-judge for automated evaluation at scale. Build regression test suites with known-good examples. Monitor production metrics and alert on deviations. And when debugging, always start with inspecting the retrieved chunks—garbage in, garbage out.
Build Better AI Tools
DevKits provides developer tools for JSON formatting, Base64 encoding, regex testing, and more — all free and privacy-first.
Try DevKits →