RAG Evaluation for Production: Beyond Context Recall

Introduction

RAG (Retrieval-Augmented Generation) evaluation is harder than standard LLM evaluation because you're evaluating two coupled systems: the retriever and the generator. Your retriever might find perfect chunks that the LLM ignores. Or your LLM might synthesize beautiful answers from garbage context. Traditional metrics like BLEU or ROUGE miss this entirely. This guide covers the metrics and workflows that actually predict production RAG quality.

The Four Core Metrics

RAG quality decomposes into four independent dimensions. You need all four — optimizing one at the expense of another leads to broken systems.

1. Context Precision

Definition: What fraction of retrieved chunks are relevant to the query?

Context Precision = (Relevant Retrieved Chunks) / (Total Retrieved Chunks)

Why it matters: Low precision means you're polluting the LLM's context window with noise, wasting tokens and degrading answer quality.

How to measure: For each query, manually label retrieved chunks as relevant/irrelevant. Precision is the ratio.

Target: >0.8 for production systems. Below 0.6 indicates retrieval is broken.

2. Context Recall

Definition: What fraction of ground-truth relevant chunks were retrieved?

Context Recall = (Relevant Retrieved Chunks) / (All Relevant Chunks in KB)

Why it matters: Low recall means you're missing critical information, leading to incomplete or wrong answers.

How to measure: Requires a labeled test set with known relevant chunks per query.

Target: >0.9 for high-stakes applications (medical, legal). >0.7 is acceptable for general use.

3. Faithfulness (Groundedness)

Definition: Is the generated answer supported by the retrieved context?

Faithfulness = (Statements in Answer Supported by Context) / (Total Statements)

Why it matters: RAG's whole point is grounding answers in your knowledge base. If the LLM hallucinates despite having context, RAG failed.

How to measure: Use an LLM-as-judge to check if each sentence in the answer can be verified from retrieved chunks.

Target: >0.95 for production. Anything below 0.9 indicates the LLM isn't using context properly.

4. Answer Relevancy

Definition: Does the generated answer actually address the user's query?

Answer Relevancy = cosine_similarity(embed(query), embed(answer))

Why it matters: High faithfulness + low relevancy = the LLM is faithfully regurgitating irrelevant context.

How to measure: Embed the question and answer, compute cosine similarity. Or use LLM-as-judge with a binary "does this answer the question?" prompt.

Target: >0.8 for production. Below 0.6 means the answer is off-topic.

The Evaluation Matrix

These four metrics create a diagnostic matrix. Here's what each failure mode looks like:

Precision	Recall	Faithfulness	Relevancy	Diagnosis
Low	High	High	Low	Retriever returns too many chunks, noise drowns signal
High	Low	High	Low	Missing critical chunks, embeddings are off
High	High	Low	Low	LLM ignores context, tune system prompt
High	High	High	Low	Query-context mismatch, wrong intent detection

LLM-as-Judge Implementation

Manual labeling doesn't scale. Use an LLM to evaluate RAG outputs. Here's a production-grade faithfulness judge:

FAITHFULNESS_PROMPT = """
You are evaluating whether an AI-generated answer is faithful to the provided context.

Context:
{context}

Generated Answer:
{answer}

Task: For each statement in the answer, determine if it is:
1. SUPPORTED: Directly stated or clearly implied by the context
2. UNSUPPORTED: Not mentioned in the context (hallucination)
3. CONTRADICTED: Conflicts with information in the context

Output JSON:
{{
  "statements": [
    {{"text": "...", "verdict": "SUPPORTED|UNSUPPORTED|CONTRADICTED", "evidence": "..."}}
  ],
  "faithfulness_score": 0.0-1.0
}}
"""

def evaluate_faithfulness(context: str, answer: str) -> dict:
    response = llm.invoke(
        FAITHFULNESS_PROMPT.format(context=context, answer=answer),
        response_format={"type": "json_object"}
    )
    return json.loads(response.content)

Run this on 100-500 query samples to get a statistically valid estimate of production faithfulness.

Retrieval Quality Deep Dive

RAG failures usually trace to bad retrieval. Here's how to diagnose:

Embedding Inspection

Manually inspect retrieved chunks for 20-30 test queries. Look for:

Semantic drift: Chunks that match keywords but wrong intent
Recency bias: Old docs outrank new docs (embeddings don't capture time)
Chunk boundary failures: Relevant info split across chunks, neither scores high enough

Similarity Score Distribution

Plot the distribution of top-K similarity scores:

import matplotlib.pyplot as plt

scores = [result['score'] for result in retrieval_results]
plt.hist(scores, bins=50)
plt.xlabel('Similarity Score')
plt.ylabel('Frequency')
plt.title('Retrieved Chunk Similarity Distribution')

Healthy distributions are bimodal: a cluster of high-scoring relevant chunks (>0.8) and a tail of low-scoring noise (<0.5). A uniform distribution indicates embeddings aren't discriminating.

Reranking Impact

Measure how much reranking improves retrieval:

baseline_precision = evaluate(retriever, top_k=10)
reranked_precision = evaluate(retriever + reranker, top_k=10)

improvement = (reranked_precision - baseline_precision) / baseline_precision

If reranking improves precision by <10%, either your embeddings are already good or your reranker is weak.

End-to-End Testing

Build a regression test suite with known-good query-answer pairs:

TEST_CASES = [
    {
        "query": "What are the side effects of ibuprofen?",
        "expected_chunks": ["doc123", "doc456"],
        "expected_answer_contains": ["stomach upset", "bleeding risk"],
        "faithfulness_threshold": 0.95
    },
    # ... 50-100 test cases
]

def run_rag_tests():
    results = []
    for test in TEST_CASES:
        retrieved = retriever.search(test['query'], top_k=5)
        answer = generator.generate(test['query'], retrieved)

        # Check retrieval
        recall = len(set(retrieved) & set(test['expected_chunks'])) / len(test['expected_chunks'])

        # Check answer
        faithfulness = evaluate_faithfulness(retrieved, answer)
        contains_expected = all(kw in answer.lower() for kw in test['expected_answer_contains'])

        results.append({
            'query': test['query'],
            'recall': recall,
            'faithfulness': faithfulness,
            'contains_expected': contains_expected,
            'passed': recall > 0.7 and faithfulness > test['faithfulness_threshold'] and contains_expected
        })

    return results

Run this on every deploy. If pass rate drops below 90%, rollback.

Debugging Failing Retrievals

When a query returns bad results:

1. Check Query Embedding

Find nearest neighbors to the query embedding in your vector DB:

query_embedding = embed(query)
neighbors = vector_db.search(query_embedding, k=20)
print([chunk['text'] for chunk in neighbors])

If top results are irrelevant, your embedding model doesn't understand the query domain.

2. Inspect Retrieved Chunks

Look at the exact text and metadata of retrieved chunks:

for chunk in retrieved_chunks:
    print(f"Score: {chunk['score']:.3f}")
    print(f"Text: {chunk['text'][:200]}...")
    print(f"Metadata: {chunk['metadata']}")
    print("---")

Check for chunking artifacts (truncated sentences, missing context).

3. Test Hypothetical Document Embeddings (HyDE)

Generate a hypothetical answer and embed that instead of the query:

hypothetical_answer = llm.invoke(f"Generate a detailed answer to: {query}")
hyde_embedding = embed(hypothetical_answer)
hyde_results = vector_db.search(hyde_embedding, k=10)

Compare HyDE results to direct query results. If HyDE is much better, your queries are too short or poorly phrased.

For interactive debugging of RAG retrieval issues, tools like RAG Debugger provide real-time visualization of chunk retrieval, similarity scores, and LLM response analysis.

Production Monitoring

Track these metrics in production:

Per-Query Metrics (Logged)

Number of chunks retrieved
Top similarity score
Average similarity score
Retrieval latency (p50, p95, p99)
Generation latency

Aggregate Metrics (Dashboarded)

Daily average top-1 similarity score (should be stable)
Fraction of queries with 0 results (should be <5%)
Faithfulness score on sampled queries (weekly batch eval)

Alerts

Alert if average top-1 score drops >10% (embedding drift or index corruption)
Alert if 0-result rate spikes (index unavailable or query preprocessing broken)
Alert if p95 retrieval latency >500ms (performance degradation)

Conclusion

RAG evaluation requires measuring both retrieval (precision, recall) and generation (faithfulness, relevancy). Use LLM-as-judge for automated evaluation at scale. Build regression test suites with known-good examples. Monitor production metrics and alert on deviations. And when debugging, always start with inspecting the retrieved chunks—garbage in, garbage out.

Build Better AI Tools

DevKits provides developer tools for JSON formatting, Base64 encoding, regex testing, and more — all free and privacy-first.

Try DevKits →