Why RAG Evaluation Is Harder Than It Looks
Most teams ship a RAG system, run a few manual queries, think it looks reasonable, and deploy. Then users report that the system confidently cites things that are not in any document, misses obvious answers that are clearly in the knowledge base, or retrieves vaguely related chunks that dilute the final answer.
The problem is that RAG quality has at least two independent failure axes: retrieval quality (did we retrieve the right chunks?) and generation quality (given retrieved chunks, did we generate a faithful, relevant answer?). A system can retrieve perfectly and still hallucinate. It can generate faithfully from bad context and still give wrong answers. You need metrics that distinguish these failure modes.
The Core Metric Taxonomy
Before looking at specific frameworks, understand what the metrics measure. There are four primary metrics that cover the retrieval-generation quality space:
Faithfulness
Does the generated answer contain only information that can be grounded in the retrieved context? A faithful answer makes no claims beyond what the retrieved documents support. This is the hallucination detection metric. A score of 1.0 means every statement in the answer is attributable to a retrieved chunk; 0.0 means the answer is entirely fabricated.
Answer Relevancy
Is the generated answer actually responsive to the question asked? A system can retrieve perfect context and generate a faithful but irrelevant response (for example, answering a different question that happens to be answerable from the same documents). Answer relevancy penalizes off-topic responses even when they are factually grounded.
Context Precision
Of the chunks retrieved, what fraction were actually needed to answer the question? Low context precision means the retriever returned noisy chunks that the generator had to ignore or work around. High-precision retrieval fetches only the signal, no noise.
Context Recall
Given the correct answer, were all the necessary facts present in the retrieved context? Low recall means your retriever missed chunks that were required to answer correctly. This requires a ground-truth reference answer to compute.
RAGAS: Reference-Free RAG Evaluation
RAGAS (Retrieval-Augmented Generation Assessment) implements all four metrics using LLM-as-judge internally, which allows reference-free evaluation of faithfulness and relevancy. You do not need ground-truth answers for all metrics, only context recall requires them.
Basic RAGAS Setup
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from datasets import Dataset
# Configure evaluator LLM (can differ from your RAG pipeline LLM)
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o", temperature=0))
evaluator_embeddings = LangchainEmbeddingsWrapper(
OpenAIEmbeddings(model="text-embedding-3-small")
)
# Prepare evaluation dataset
# Each row: question, answer (generated), contexts (retrieved), ground_truth (optional)
eval_data = {
"question": [
"What is the maximum token context window of Claude 3.5 Sonnet?",
"Which vector database has the lowest p99 latency under 1M vectors?",
],
"answer": [
"Claude 3.5 Sonnet supports a 200,000 token context window.",
"According to the benchmarks, Qdrant achieves the lowest p99 latency.",
],
"contexts": [
[
"Claude 3.5 Sonnet has a context window of 200K tokens and outputs up to 8192 tokens.",
"Anthropic's model family supports varying context lengths depending on the tier.",
],
[
"Benchmark results show Qdrant at 4.2ms p99 for 1M vectors with HNSW index.",
"Pinecone achieves 6.1ms p99 under similar conditions.",
],
],
"ground_truth": [
"200,000 tokens",
"Qdrant",
],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=evaluator_llm,
embeddings=evaluator_embeddings,
)
print(result)
# Outputs a dict: {faithfulness: 0.92, answer_relevancy: 0.87, ...}
# Convert to DataFrame for analysis
df = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy", "context_precision", "context_recall"]])
RAGAS in CI/CD
The most valuable use of RAGAS is as a regression gate. When you change your chunking strategy, swap embedding models, or modify your prompt template, run RAGAS on a fixed evaluation set and fail the pipeline if any metric drops below threshold:
import json
import sys
from pathlib import Path
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
THRESHOLDS = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_precision": 0.75,
"context_recall": 0.70,
}
def run_regression_eval(eval_dataset_path: str, output_path: str) -> bool:
data = json.loads(Path(eval_dataset_path).read_text())
dataset = Dataset.from_dict(data)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
scores = dict(result)
Path(output_path).write_text(json.dumps(scores, indent=2))
passed = True
for metric, threshold in THRESHOLDS.items():
score = scores.get(metric, 0.0)
status = "PASS" if score >= threshold else "FAIL"
print(f" {metric}: {score:.3f} (threshold: {threshold}) [{status}]")
if score < threshold:
passed = False
return passed
if __name__ == "__main__":
ok = run_regression_eval("eval_set.json", "eval_results.json")
sys.exit(0 if ok else 1)
TruLens: Continuous Monitoring and Feedback Loops
TruLens (now TruLens-Eval) by TruEra focuses on production monitoring rather than offline evaluation. It wraps your RAG pipeline as a "recorder" and captures every input, output, and retrieved context. Evaluations run asynchronously on captured data, which means you can evaluate production traffic without adding latency to user requests.
Instrumenting a Pipeline with TruLens
import os
from trulens.core import TruSession, Feedback
from trulens.apps.langchain import TruChain
from trulens.providers.openai import OpenAI as TruOpenAI
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Initialize TruLens session (SQLite for dev, Postgres for prod)
session = TruSession(database_url="sqlite:///rag_evals.db")
session.reset_database()
# Build your RAG chain (simplified)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template(
"Answer the question based only on the following context:\n\n{context}\n\nQuestion: {question}"
)
def retrieve(question: str) -> str:
# Your actual retriever here
return "Retrieved context would appear here."
rag_chain = (
{"context": retrieve, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Define feedback functions
provider = TruOpenAI(model_engine="gpt-4o")
# Groundedness = faithfulness: does answer only contain claims from context?
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on_input_output()
)
# Answer relevance: is the answer relevant to the question?
f_answer_relevance = (
Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
.on_input_output()
)
# Context relevance: how relevant are retrieved chunks to the question?
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
.on_input()
.on(lambda x: x.get("context", ""))
.aggregate(min) # min over chunks is conservative
)
# Wrap your chain for recording
tru_rag = TruChain(
rag_chain,
app_name="ProductionRAG",
app_version="v1.2.0",
feedbacks=[f_groundedness, f_answer_relevance, f_context_relevance],
)
# Run queries — TruLens records everything
test_questions = [
"What are the rate limits for the OpenAI API?",
"How does HNSW indexing work?",
"What is the difference between BM25 and dense retrieval?",
]
with tru_rag as recording:
for q in test_questions:
response = rag_chain.invoke(q)
print(f"Q: {q}")
print(f"A: {response}\n")
# View results
leaderboard = session.get_leaderboard()
print(leaderboard)
# Launch dashboard (Streamlit UI)
# session.run_dashboard()
TruLens for A/B Testing RAG Configurations
TruLens's app versioning makes it straightforward to compare two configurations. Register both configurations, run the same question set through each, and the leaderboard shows side-by-side metric comparisons. This is invaluable when deciding whether a new chunking strategy or embedding model actually improves quality in practice.
DeepEval: Unit Testing for LLM Applications
DeepEval brings a pytest-native testing philosophy to LLM evaluation. You write test cases that assert metric thresholds, and failing evaluations fail your test suite. This is the most CI/CD-friendly of the three frameworks because it integrates with existing Python testing infrastructure.
DeepEval Test Cases
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
HallucinationMetric,
BiasMetric,
ToxicityMetric,
)
# Define your RAG pipeline (replace with your actual implementation)
def run_rag(question: str) -> tuple[str, list[str]]:
"""Returns (answer, retrieved_contexts)"""
# Your pipeline here
answer = "Placeholder answer"
contexts = ["Placeholder context chunk 1", "Placeholder context chunk 2"]
return answer, contexts
# Faithfulness metric — threshold 0 to 1, higher is better
faithfulness_metric = FaithfulnessMetric(
threshold=0.8,
model="gpt-4o",
include_reason=True,
)
# Answer relevancy
answer_relevancy_metric = AnswerRelevancyMetric(
threshold=0.75,
model="gpt-4o",
include_reason=True,
)
# Contextual precision — requires ground truth
contextual_precision_metric = ContextualPrecisionMetric(
threshold=0.7,
model="gpt-4o",
)
# Hallucination detection (alternative to faithfulness)
hallucination_metric = HallucinationMetric(
threshold=0.2, # Lower is better for this metric
model="gpt-4o",
)
@pytest.mark.parametrize("question,ground_truth", [
(
"What is the max context window of Claude 3.5 Sonnet?",
"200,000 tokens",
),
(
"Which Python version introduced structural pattern matching?",
"Python 3.10",
),
(
"What HTTP status code means too many requests?",
"429 Too Many Requests",
),
])
def test_rag_quality(question: str, ground_truth: str):
answer, contexts = run_rag(question)
test_case = LLMTestCase(
input=question,
actual_output=answer,
expected_output=ground_truth,
retrieval_context=contexts,
)
assert_test(test_case, [
faithfulness_metric,
answer_relevancy_metric,
contextual_precision_metric,
hallucination_metric,
])
# Advanced: custom metric using G-Eval (LLM-as-judge with rubric)
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
conciseness_metric = GEval(
name="Conciseness",
criteria="Does the answer avoid unnecessary verbosity and filler phrases?",
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
threshold=0.7,
model="gpt-4o",
)
def test_answer_conciseness():
question = "What does the acronym API stand for?"
answer, contexts = run_rag(question)
test_case = LLMTestCase(
input=question,
actual_output=answer,
retrieval_context=contexts,
)
assert_test(test_case, [conciseness_metric])
Running DeepEval in GitHub Actions
# .github/workflows/rag-eval.yml
name: RAG Quality Gate
on:
pull_request:
paths:
- 'src/rag/**'
- 'prompts/**'
- 'tests/rag/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.11'
- run: pip install deepeval pytest
- name: Run RAG quality tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
DEEPEVAL_API_KEY: ${{ secrets.DEEPEVAL_API_KEY }}
run: |
deepeval test run tests/rag/test_quality.py \
--confident-api-key $DEEPEVAL_API_KEY \
--verbose
Implementing Faithfulness Without a Framework
Sometimes you want to understand what faithfulness measurement actually does under the hood, or you need to implement it in an environment where installing RAGAS is not feasible. Here is a minimal faithful-to-context checker:
from openai import OpenAI
from dataclasses import dataclass
from typing import Optional
import json
client = OpenAI()
@dataclass
class FaithfulnessResult:
score: float
supported_claims: list[str]
unsupported_claims: list[str]
reasoning: str
def measure_faithfulness(
answer: str,
contexts: list[str],
model: str = "gpt-4o",
) -> FaithfulnessResult:
"""
Decompose the answer into atomic claims, then verify each claim
against the provided context. Returns a faithfulness score in [0, 1].
"""
context_text = "\n\n---\n\n".join(
f"[Context {i+1}]: {ctx}" for i, ctx in enumerate(contexts)
)
system_prompt = """You are a faithful evaluation judge. Your task is to:
1. Decompose the ANSWER into atomic factual claims (one claim per bullet).
2. For each claim, determine if it can be directly supported by the CONTEXT.
3. Return a JSON object with this exact structure:
{
"claims": [
{"claim": "...", "supported": true/false, "evidence": "quote from context or null"}
],
"reasoning": "brief explanation of your evaluation"
}
Only return valid JSON, no other text."""
user_prompt = f"""CONTEXT:
{context_text}
ANSWER:
{answer}
Evaluate each factual claim in the answer against the context."""
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt},
],
temperature=0,
response_format={"type": "json_object"},
)
data = json.loads(response.choices[0].message.content)
claims = data["claims"]
supported = [c["claim"] for c in claims if c["supported"]]
unsupported = [c["claim"] for c in claims if not c["supported"]]
score = len(supported) / len(claims) if claims else 0.0
return FaithfulnessResult(
score=score,
supported_claims=supported,
unsupported_claims=unsupported,
reasoning=data.get("reasoning", ""),
)
# Usage
result = measure_faithfulness(
answer="The HNSW algorithm provides O(log n) query time and was developed at Yandex.",
contexts=[
"HNSW (Hierarchical Navigable Small Worlds) provides approximate nearest neighbor search "
"with O(log n) query complexity and O(n log n) construction time.",
"The algorithm was introduced by Yu. A. Malkov and D. A. Yashunin in their 2018 paper.",
],
)
print(f"Faithfulness: {result.score:.2f}")
print(f"Unsupported: {result.unsupported_claims}")
Choosing the Right Evaluation Framework
The three frameworks serve different stages of the development lifecycle:
- RAGAS is best for offline evaluation and benchmarking. Use it to measure quality before and after infrastructure changes. Its reference-free metrics make it practical even without a large labeled dataset.
- TruLens is best for production monitoring and A/B testing. Its async evaluation model means you can score production traffic without adding latency. The leaderboard UI makes it easy to compare configurations over time.
- DeepEval is best for CI/CD integration. Its pytest-native design lets you add RAG quality gates to existing test suites. Fail a pull request if faithfulness drops below 0.85.
In a mature RAG engineering workflow, all three appear in different places: DeepEval in the CI pipeline, RAGAS for quarterly benchmark reports, and TruLens for production dashboards. They are complementary, not competing.
Common Evaluation Pitfalls
Before closing, three pitfalls that consistently mislead teams:
- Evaluator model quality matters. Using
gpt-3.5-turboas your faithfulness judge produces unreliable scores. The evaluator model needs to be at least as capable as the model being evaluated. Use GPT-4o or Claude 3.5 Sonnet as your judge. - Evaluation set distribution shift. An evaluation set curated by engineers who know the system will be easier than real user queries. Sample real user queries (with PII removed) for your evaluation set.
- Context precision without a retrieval budget. Context precision is meaningless if you're allowed to retrieve 20 chunks. Define a fixed retrieval budget (k=5 is a reasonable default) before measuring precision.
Conclusion
RAG evaluation is not a one-time activity. It is an ongoing engineering discipline. The teams shipping the most reliable RAG systems instrument their pipelines with continuous evaluation, set metric thresholds in CI, and treat falling scores with the same urgency as rising error rates. The tooling — RAGAS, TruLens, DeepEval — exists. The discipline to use it consistently is what separates production-ready systems from perpetual prototypes.