A rigorous engineering guide to LLM evaluation — covering RAGAS, TruLens, DeepEval, the mathematics behind retrieval and generation metrics, and how to wire it all into CI/CD so quality regressions never ship.
"You can't improve what you can't measure. In LLM systems, this is a mathematical statement, not a platitude."
RAGASTruLensDeepEvalLLM-as-JudgeRAG EvalNIST AI RMF
01
Why Evaluation Matters
// THE PROBLEM WITH VIBES-BASED DEVELOPMENT
LLM systems fail silently. A prompt change that improves one query type degrades five others. A new retriever version gets better recall but retrieves noisier chunks. Without systematic evaluation, you are flying blind — "it seems better" is not a shipping criterion.
The evaluation gap is the defining engineering challenge of LLM development. Traditional software has deterministic outputs you can assert against. LLMs produce probabilistic, open-ended text — requiring a new discipline that combines information retrieval theory, NLP metrics, and statistical testing.
Offline Evaluation
Run against a held-out labeled dataset before deployment. Catches regressions before users see them. Gives you a reproducible benchmark. Fast feedback loop in CI.
Online Evaluation
Monitor real production traffic with automated quality signals. Detects distribution shift, emerging failure modes, and per-user degradation that offline sets can't anticipate.
Human Evaluation
The ground truth. Use for calibrating automated metrics, adjudicating edge cases, and building preference datasets. Expensive — use strategically, not as a primary loop.
📐
The evaluation triad: Every LLM eval system must track three things simultaneously — quality (does it answer correctly?), safety (does it avoid harmful outputs?), and cost/latency (is it fast and cheap enough?). Optimizing one without tracking the others produces systems that degrade on the ignored dimensions.
02
Evaluation Taxonomy
// WHAT ARE YOU ACTUALLY MEASURING
Not all eval metrics are created equal. The choice of metric depends on what component you're evaluating, what failure mode you're guarding against, and whether you have ground-truth labels available.
Metric Type
Requires Ground Truth?
What It Measures
Typical Tools
Reference-based
YES
Similarity of generated output to a gold-standard answer (BLEU, ROUGE, BERTScore, Exact Match)
NLTK, HuggingFace evaluate
Reference-free
NO
Internal quality signals: faithfulness, coherence, toxicity — evaluated without a gold answer
RAGAS, TruLens, DeepEval
LLM-as-Judge
OPTIONAL
A strong LLM (GPT-4, Claude) scores the response on defined rubrics. Flexible, scalable, expensive
OpenAI Evals, LangSmith, any RAGAS
Retrieval-specific
YES
Precision, Recall, NDCG, MRR — measures whether the right chunks were retrieved
BEIR, RAGAS, custom IR pipelines
Task-specific
YES
Custom metrics for your domain: F1 on entity extraction, accuracy on MCQ, slot-fill rate for agents
The BLEU trap: BLEU and ROUGE were designed for machine translation and summarization in the early 2000s. They measure n-gram overlap, which correlates poorly with quality for modern open-ended LLM outputs. A response that is semantically identical but uses different phrasing scores near zero. Do not use BLEU/ROUGE as primary metrics for RAG QA — use BERTScore, semantic similarity, or LLM-as-judge instead.
03
RAG Evaluation Primer
// THE FOUR FAILURE MODES OF A RAG PIPELINE
Retrieval-Augmented Generation has two independently-failing components — the retriever and the generator. A bug in either produces wrong answers but in very different ways. You need metrics that isolate each component, plus end-to-end metrics that evaluate the full pipeline.
Input
User Query
→
Retriever
Chunk Selection
→
Context
Retrieved Chunks
→
Generator
LLM Response
→
Output
Final Answer
Retriever Failure Modes Upstream
Low precision: Retrieved chunks are irrelevant noise — LLM hallucinates to fill gaps
Low recall: Relevant chunks are missing — LLM can't answer even with perfect generation
Wrong ordering: Most relevant chunk is position 5/5 — lost-in-the-middle failure
Chunk granularity: Chunks too large (dilution) or too small (missing context)
Generator Failure Modes Downstream
Hallucination: Claims facts not present in retrieved context
Context ignorance: Ignores retrieved context, answers from parametric memory
Faithfulness: Contradicts or misrepresents what the context says
Verbosity / conciseness: Answers are padded or missing key detail
The Core Evaluation Triangle
Faithfulness
F
% of claims in answer that are grounded in retrieved context
Answer Relevance
AR
How completely and precisely the answer addresses the query
Context Precision
CP
What fraction of retrieved chunks are actually relevant
Context Recall
CR
What fraction of needed info was present in retrieved context
04
RAGAS
// RETRIEVAL AUGMENTED GENERATION ASSESSMENT
RAGAS (Retrieval Augmented Generation Assessment) is the most widely-adopted framework for reference-free RAG evaluation. It uses an LLM evaluator to score your RAG pipeline across four core metrics without requiring manually-labeled gold answers — making it practical to run at scale.
∑
Reference-free by design: RAGAS's key insight is that you can evaluate RAG quality using only three inputs — the question, the retrieved context, and the generated answer. No ground-truth required for Faithfulness and Answer Relevance. Context Recall does require a ground-truth answer as proxy for "all necessary information."
Decompose answer into atomic claims. LLM judge verifies each claim against retrieved context. Score = fraction verifiable. Ranges [0, 1]. A score below 0.8 indicates significant hallucination.
Metric 2 — Answer Relevance
AR = mean( cosine_similarity( embed(q), embed(q̂ᵢ) ) ) for i in 1..n
Generate n synthetic questions from the answer. Embed the original query and each synthetic question. Mean cosine similarity = how well the answer "covers" the query. Low score = off-topic or incomplete.
Weighted precision that rewards relevant chunks appearing at higher positions. Penalizes retrievers that bury the relevant chunk at position k. Equivalent to Average Precision in IR literature.
Decompose the ground-truth answer into sentences. LLM judge checks if each sentence can be attributed to at least one retrieved chunk. Requires a ground-truth answer.
Installation & Quick Start
Python — RAGAS Quick Start
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness, # requires ground truth
answer_similarity, # semantic similarity to ground truth
)
from datasets import Dataset
# Build your evaluation dataset# Each row: question, answer, contexts (list of chunks), ground_truth
eval_data = {
"question": ["What is the return policy?", ...],
"answer": ["Returns accepted within 30 days...", ...],
"contexts": [["Our return policy states...", "Items must be..."], ...],
"ground_truth": ["Returns are accepted within 30 days of purchase", ...],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation (uses gpt-4 as judge by default)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=ChatOpenAI(model="gpt-4o"), # or Claude, Gemini
embeddings=OpenAIEmbeddings(),
raise_exceptions=False,
show_progress=True,
)
# Results as pandas DataFrame
df = result.to_pandas()
print(result)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,# 'context_precision': 0.82, 'context_recall': 0.79}
RAGAS Score Interpretation
Metric
≥ 0.9 (Excellent)
0.7–0.9 (Acceptable)
< 0.7 (Needs Work)
Primary Root Cause
Faithfulness
Minimal hallucination
Some ungrounded claims
High hallucination rate
Weak system prompt, too-short context, aggressive summarization
Answer Relevancy
On-topic, complete
Partially addressed
Off-topic / vague
Poor retrieval, bad prompt structure, over-verbose LLM
Context Precision
Retriever very precise
Some noise in context
Too much irrelevant noise
Low similarity threshold, bad chunking, wrong embedding model
Context Recall
All info retrieved
Most info present
Key info missing
top-k too small, chunks too large, suboptimal embedding
05
TruLens
// FEEDBACK FUNCTIONS & THE RAG TRIAD
TruLens (by TruEra) takes a different architectural approach from RAGAS — it instruments your LLM application at the function level, logging every LLM call, and attaches feedback functions that score interactions. This makes it ideal for continuous monitoring of production systems.
TruLens RAG Triad Core Model
TruLens maps its feedback functions explicitly to three relationships in a RAG pipeline:
Answer Relevance: Answer ↔ Question — does the answer address the question?
Context Relevance: Context ↔ Question — does the retrieved context match what was asked?
Groundedness: Answer ↔ Context — is the answer grounded in (not hallucinated from) the context?
Continuous Recording Production
TruLens wraps your chain/app in a TruChain or TruLlama recorder. Every call is stored in a local SQLite (or cloud) database with full input/output traces, latency, token cost, and computed feedback scores. Inspect in the TruLens dashboard or query programmatically.
Python — TruLens RAG Triad Setup
from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI as TruOpenAI
from trulens.apps.langchain import TruChain
import numpy as np
session = TruSession() # stores results in local SQLite
provider = TruOpenAI(model_engine="gpt-4o") # LLM-as-judge provider# ── Define the three feedback functions ──────────────────────# 1. Context Relevance: each chunk vs the query
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
.on_input() # question
.on(TruChain.select_context()) # each retrieved chunk
.aggregate(np.mean) # mean over all chunks
)
# 2. Groundedness: answer claims vs context
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
.on(TruChain.select_context().collect()) # all chunks as list
.on_output() # final answer
)
# 3. Answer Relevance: answer vs question
f_answer_relevance = (
Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
.on_input_output()
)
# ── Wrap your LangChain RAG chain ────────────────────────────
tru_recorder = TruChain(
your_langchain_rag_chain, # your existing chain
app_name="product-qa-rag",
app_version="v2.3.1", # version tag for regression tracking
feedbacks=[f_context_relevance, f_groundedness, f_answer_relevance],
)
# ── Run evaluation ───────────────────────────────────────────with tru_recorder as recording:
for question in eval_questions:
response = your_langchain_rag_chain.invoke({"query": question})
# View results
session.get_leaderboard() # compare across app_version tags
session.run_dashboard() # launch local Streamlit dashboard
TruLens Leaderboard — Version Comparison
The TruLens leaderboard pattern is the key mechanism for proving improvement. Tag each experiment with a version string and compare across runs — the leaderboard shows all three RAG triad scores plus latency and cost per version.
Python — Programmatic Regression Check
# Compare v2.3.1 against v2.3.0 baseline
leaderboard = session.get_leaderboard(app_ids=["product-qa-rag"])
print(leaderboard[["app_version", "Groundedness", "Answer Relevance", "Context Relevance", "latency", "total_cost"]])
# app_version Groundedness Answer Relevance Context Relevance latency total_cost# v2.3.0 0.74 0.82 0.71 1.23s $0.0042# v2.3.1 0.91 0.88 0.85 1.31s $0.0045 ← improved# Fail CI if any metric regresses more than 5% from baseline
baseline = leaderboard[leaderboard["app_version"] == "v2.3.0"].iloc[0]
current = leaderboard[leaderboard["app_version"] == "v2.3.1"].iloc[0]
threshold = 0.05for metric in ["Groundedness", "Answer Relevance", "Context Relevance"]:
delta = current[metric] - baseline[metric]
if delta < -threshold:
raiseAssertionError(f"{metric} regressed by {delta:.2%} — blocking merge")
06
DeepEval
// PYTEST FOR LLM SYSTEMS
DeepEval takes a testing-first philosophy — it integrates directly with pytest and provides a rich library of LLM-specific metrics as assert-able test cases. Think of it as "unit tests for your prompts." It covers RAG metrics, conversational metrics, agentic task metrics, and safety checks in a single framework.
G-Eval Flexible
Define custom evaluation criteria in plain English. The LLM judge generates an evaluation chain-of-thought, then produces a score 0–10. Best for domain-specific quality attributes without coding a custom metric.
DAG Metrics Agentic
Evaluate multi-step agent tasks using a Directed Acyclic Graph of expected tool calls and intermediate states. Measures task completion rate, trajectory correctness, and efficiency.
Conversational Multi-turn
Role-consistency, knowledge retention across turns, and conversation completeness for chatbot applications. Goes beyond single-turn RAG evaluation.
Python — DeepEval Pytest Integration
# test_rag_quality.py — run with: deepeval test run test_rag_quality.pyimport pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
ContextualPrecisionMetric,
ContextualRecallMetric,
HallucinationMetric,
ToxicityMetric,
GEval,
)
@pytest.mark.parametrize("test_case", load_test_cases("eval_dataset.json"))
deftest_rag_pipeline(test_case):
# Run your RAG pipeline on the test case
result = rag_pipeline.query(test_case.input)
llm_test_case = LLMTestCase(
input=test_case.input,
actual_output=result.answer,
expected_output=test_case.expected, # ground truth (optional)
retrieval_context=result.retrieved_chunks,
context=test_case.reference_documents,
)
assert_test(llm_test_case, [
FaithfulnessMetric(threshold=0.85, model="gpt-4o"),
AnswerRelevancyMetric(threshold=0.80),
ContextualPrecisionMetric(threshold=0.75),
ContextualRecallMetric(threshold=0.75),
HallucinationMetric(threshold=0.15), # hallucination RATE — lower is betterToxicityMetric(threshold=0.05),
])
# Custom G-Eval — domain-specific quality criterion
conciseness_metric = GEval(
name="Conciseness",
model="gpt-4o",
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
evaluation_steps=[
"Check that the answer does not contain unnecessary padding or repetition",
"Verify the answer directly addresses the question without preamble",
"Confirm key information is present without verbosity",
],
threshold=0.7,
)
deftest_conciseness():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output=rag_pipeline.query("What is the capital of France?").answer,
)
assert_test(test_case, [conciseness_metric])
▸
DeepEval CLI:deepeval test run integrates with CI and outputs structured JSON results. Use deepeval login to sync results to the Confident AI cloud dashboard for trend tracking. The --fail-or-exit-on-metric flag fails the test suite if a specific metric drops below threshold — ideal for PR gates.
07
HELM, OpenAI Evals, LangSmith
// CHOOSING THE RIGHT TOOL FOR THE JOB
Framework
Best For
Approach
Self-hosted?
RAG Native?
RAGAS
RAG evaluation, fast offline benchmarking
Reference-free LLM-as-judge
YES
YES
TruLens
Production monitoring, experiment tracking
Feedback functions + recording
YES
YES
DeepEval
Test-driven development, CI integration
Pytest plugin, 30+ metrics
YES
YES
HELM
Multi-model benchmarking, academic comparison
Task-based standardized scenarios
YES
PARTIAL
OpenAI Evals
Custom task evals, fine-tune validation
YAML-based eval definitions, completions API
PARTIAL
PARTIAL
LangSmith
Full LLM ops: tracing + eval + monitoring
Cloud trace store + eval runners
NO
YES
Phoenix (Arize)
ML observability, embedding drift
OpenTelemetry traces + eval functions
YES
YES
PromptFoo
Prompt regression testing, red-teaming
Config-driven comparisons across providers
YES
PARTIAL
📐
Recommended stack: Use DeepEval for CI/CD test gates (fast, pytest-native), RAGAS for periodic benchmark runs against your labeled evaluation set, and Phoenix or LangSmith for production trace monitoring. You don't need to pick one — they complement each other.
08
Retrieval Metrics Deep-Dive
// THE MATHEMATICS OF FINDING THE RIGHT CHUNK
Before evaluating the generator, validate the retriever in isolation. These metrics come from classical Information Retrieval and have precise mathematical definitions. Use the BEIR benchmark for zero-shot evaluation of your embedding model.
Precision@k — What fraction of top-k retrieved docs are relevant?
Precision@k = |relevant_docs ∩ retrieved_top_k| / k
Use when the cost of irrelevant context is high (LLM gets confused by noise). A retriever with P@5=0.6 means 3 of 5 retrieved chunks are relevant.
Recall@k — What fraction of all relevant docs are in top-k?
Use when missing relevant context is catastrophic (can't answer without it). For QA, recall is typically more important than precision — retrieve more, let the LLM filter.
Mean Reciprocal Rank — How high is the FIRST relevant doc?
MRR = (1/|Q|) × Σᵢ (1 / rankᵢ)
rankᵢ = position of first relevant document for query i. MRR penalizes systems that bury the answer at position 4 or 5. Critical for "lost-in-the-middle" detection.
NDCG@k — Normalized Discounted Cumulative Gain (graded relevance)
IDCG = ideal DCG (perfect ranking). NDCG supports graded relevance (0=irrelevant, 1=somewhat, 2=highly relevant). The log₂ denominator discounts lower positions — getting relevant docs to rank 1 matters more than rank 5.
Retriever Benchmarking with BEIR
Python — BEIR Zero-shot Eval
from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
# Load a BEIR dataset (e.g., FIQA for financial QA, NFCorpus for medical)
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader.load(data_path)
# Swap in your embedding model
model = DRES(models.SentenceBERT("BAAI/bge-large-en-v1.5"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim", k_values=[1, 3, 5, 10])
results = retriever.retrieve(corpus, queries)
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print("NDCG@10:", ndcg["NDCG@10"]) # 0.412 (BGE-large on FIQA)print("Recall@5:", recall["Recall@5"]) # 0.589
09
Generation Metrics Deep-Dive
// BEYOND BLEU — SEMANTIC SIMILARITY AND NLI
Generation quality metrics fall into two families: lexical (token overlap — fast, cheap, poor correlation with quality) and semantic (embedding/model-based — expensive, strong correlation). Use semantic metrics for production evaluation.
Metric
Method
Requires GT?
Correlation w/ Human
Cost
BLEU
n-gram precision vs reference
YES
Low (QA)
Free
ROUGE-L
Longest common subsequence
YES
Medium (summ.)
Free
BERTScore
Contextual embedding similarity (F1)
YES
High
Low (local model)
Semantic Similarity
Cosine sim of sentence embeddings
YES
High
Low (local model)
NLI Entailment
NLI model: does context entail answer?
NO
High
Low (local model)
LLM-as-Judge
GPT-4/Claude rates on rubric
OPTIONAL
Very High
High ($)
BERTScore — token-level precision, recall, F1 using BERT embeddings
ŷ = generated tokens, y = reference tokens. Each token is matched to its closest semantic counterpart in the other sequence. F_BERT ≥ 0.85 on DeBERTa typically indicates high quality.
Python — Generation Metrics Suite
from bert_score import score as bertscore
from sentence_transformers import SentenceTransformer, util as st_util
from transformers import pipeline
# ── BERTScore ─────────────────────────────────────────────────
predictions = ["The return window is 30 days from purchase."]
references = ["Items can be returned within 30 days of the purchase date."]
P, R, F1 = bertscore(predictions, references, model_type="microsoft/deberta-xlarge-mnli")
print(f"BERTScore F1: {F1.mean():.4f}") # → 0.9341# ── Semantic Similarity ───────────────────────────────────────
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
emb_pred = embedder.encode(predictions, convert_to_tensor=True)
emb_ref = embedder.encode(references, convert_to_tensor=True)
sim = st_util.cos_sim(emb_pred, emb_ref)
print(f"Semantic Similarity: {sim.item():.4f}") # → 0.9627# ── NLI Entailment (reference-free faithfulness) ───────────────
nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-small")
context = "Our return policy allows returns within 30 days of purchase."
answer = "You have 30 days to return any item."
result = nli(f"{context} [SEP] {answer}")[0]
print(result) # {'label': 'ENTAILMENT', 'score': 0.9812}# ENTAILMENT = answer is grounded in context ✓# CONTRADICTION = answer contradicts context ✗ (hallucination)# NEUTRAL = answer is neither supported nor contradicted
10
LLM-as-Judge
// USING AI TO EVALUATE AI — RESPONSIBLY
LLM-as-judge is currently the highest-quality scalable evaluation method. A strong judge model (GPT-4o, Claude 3.5 Sonnet) evaluates responses against a structured rubric. The quality of your evaluation is determined by the quality of your rubric — vague rubrics produce noisy, unreliable scores.
LLM Judge Failure Modes Know These
Verbosity bias: Longer answers score higher regardless of quality
Self-preference: GPT-4 prefers GPT-4 outputs; Claude prefers Claude outputs
Position bias: In pairwise evals, Option A scores higher if listed first
Sycophancy: Agrees with user opinion expressed in the prompt
Format sensitivity: Score changes based on bullet vs. prose presentation
Mitigations Best Practices
Chain-of-thought: Ask judge to reason step-by-step before scoring
Calibration examples: Include 2–3 scored examples (few-shot) in system prompt
Score swapping: Run pairwise eval twice, swap A/B positions, take mean
Use different judge than generator — avoid self-evaluation
Validate against human labels — measure judge-human correlation (target ≥ 0.75 Spearman)
Python — Production-Grade LLM Judge
# High-quality rubric with chain-of-thought scoring
JUDGE_SYSTEM_PROMPT = """You are an expert evaluator for a customer support RAG system.
Evaluate the assistant's response on FAITHFULNESS — the degree to which
every factual claim in the response is directly supported by the provided context.
Scoring rubric:
5 — Every claim is explicitly supported by the context. No extrapolation.
4 — Nearly all claims supported; one minor inference beyond the context.
3 — Most claims supported; one unsupported factual claim present.
2 — Several claims not grounded in context; noticeable hallucination.
1 — Response contains significant fabricated information.
Calibration examples:
Context: "Shipping takes 3-5 business days."
Response: "Your order will arrive in 3 to 5 business days."
Score: 5 — Direct restatement, fully grounded.
Context: "Shipping takes 3-5 business days."
Response: "Your order ships from our Chicago warehouse in 3-5 days."
Score: 2 — 'Chicago warehouse' is fabricated.
You MUST respond with valid JSON only: {"reasoning": "...", "score": N}"""defjudge_faithfulness(question: str, context: str, answer: str) -> dict:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": JUDGE_SYSTEM_PROMPT},
{"role": "user", "content": f"""
Question: {question}
Context: {context}
Response to evaluate: {answer}
Evaluate faithfulness. Respond with JSON only."""},
],
temperature=0.0, # deterministic — critical for reproducibility
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
# Returns: {"reasoning": "The answer states '30 days' which matches...", "score": 5}
11
Evaluation Dataset Construction
// GARBAGE IN, GARBAGE OUT — YOUR EVAL SET IS YOUR SPEC
Your evaluation dataset defines what "quality" means for your system. A weak eval set gives you false confidence. A strong eval set is the most valuable artifact in your LLM development process — treat it like production code: version it, review it, maintain it.
Synthetic Generation (RAGAS TestsetGenerator)
Bootstrap your eval set from your own document corpus. RAGAS generates question/answer pairs from your documents using an LLM, covering simple factual, multi-context reasoning, and conditional question types. Rapid start — validate samples before trusting scores.
Real User Queries (Production Mining)
Sample real queries from production logs. These represent actual user intent distribution — far more valuable than synthetic. Label a random sample weekly. 200 labeled examples is a practical minimum for statistically meaningful metrics.
Python — Synthetic Eval Set via RAGAS
from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader
# Load your knowledge base documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()
generator = TestsetGenerator.with_openai(
generator_llm="gpt-4o",
critic_llm="gpt-4o", # critic filters low-quality questions
embeddings="text-embedding-3-small",
)
testset = generator.generate_with_langchain_docs(
documents,
test_size=200,
distributions={
simple: 0.5, # 50% direct factual questions
reasoning: 0.25, # 25% multi-hop reasoning
multi_context: 0.25, # 25% require multiple chunks
},
with_debugging_logs=True,
)
df = testset.to_pandas()
df.to_csv("eval_dataset_v1.csv", index=False)
# Schema: question | contexts | ground_truth | evolution_type | metadata# IMPORTANT: Human review minimum 10% of samples before using as ground truth
⚠
Eval set contamination: Never tune prompts or hyperparameters against your primary eval set — that's the test set, not the validation set. Use a three-way split: dev set for iteration, validation set for go/no-go decisions on each PR, test set for quarterly benchmarks and external reporting only. Contamination makes your metrics meaningless.
12
Regression Testing
// PROVING IMPROVEMENT MATHEMATICALLY
A higher score on a new version doesn't prove improvement unless it's statistically significant. Random variation in LLM outputs and in the LLM judge means small deltas are noise. Use statistical tests to distinguish signal from variance.
Paired t-test — Is the score difference significant?
t = (x̄_diff) / (s_diff / √n) where x̄_diff = mean(scoreᵢ_new − scoreᵢ_old)
Use paired t-test (same queries, two versions) for continuous metrics (faithfulness, relevance). p < 0.05 with n ≥ 50 is a reasonable threshold. Use Wilcoxon signed-rank test for non-normal distributions.
McNemar's Test — Pass/fail metric comparisons (binary)
χ² = (b − c)² / (b + c) where b = v1 pass & v2 fail count, c = v1 fail & v2 pass count
Use for binary metrics (passed threshold / failed threshold). Compares two classifiers on the same test set. χ² > 3.84 → p < 0.05 for 1 degree of freedom.
// EVAL GATES THAT BLOCK REGRESSIONS FROM SHIPPING
Eval as a CI gate means every pull request that touches prompts, retriever configuration, embedding models, or chunking logic must pass a defined eval threshold before it can merge. This transforms eval from an ad-hoc activity into a quality guarantee.
YAML — GitHub Actions Eval Pipeline
name: LLM Eval Gate
on:
pull_request:
paths:
- 'src/prompts/**'
- 'src/retriever/**'
- 'config/rag_config.yaml'
- 'requirements.txt'
jobs:
eval:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with: { python-version: '3.11' }
- name: Install dependencies
run: pip install -r requirements-eval.txt
# Fast smoke tests — run on every PR (< 5 min, 30 queries)
- name: DeepEval smoke test
run: deepeval test run tests/eval/smoke/ --exit-on-first-failure
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# Full regression suite — run only if smoke passes (< 20 min, 200 queries)
- name: Full RAGAS regression evaluation
run: |
python scripts/run_ragas_eval.py \
--dataset eval_data/validation_set_v3.json \
--baseline-version ${{ github.base_ref }} \
--output results/ragas_${{ github.sha }}.json
- name: Check regression thresholds
run: |
python scripts/check_thresholds.py \
--results results/ragas_${{ github.sha }}.json \
--thresholds config/eval_thresholds.yaml
# Exits non-zero if any metric regresses > threshold → blocks merge
- name: Comment results on PR
uses: actions/github-script@v7
if: always()
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('results/ragas_${{ github.sha }}.json'));
const body = formatResultsTable(results); // helper function
github.rest.issues.createComment({ issue_number: context.issue.number, body });
YAML — Eval Thresholds Config
# config/eval_thresholds.yaml# HARD BLOCK: metric must not drop below absolute floorabsolute_floors:
faithfulness: 0.80# never ship below 80% faithfulness
answer_relevancy: 0.75
context_precision: 0.70
context_recall: 0.70
toxicity_rate: 0.02# max 2% toxic outputs# REGRESSION BLOCK: metric must not drop more than N% vs baselineregression_tolerances:
faithfulness: 0.03# 3% regression tolerance
answer_relevancy: 0.04
context_precision: 0.05
context_recall: 0.05
p50_latency_ms: 200# 200ms latency regression tolerance
cost_per_query_usd: 0.002# $0.002 cost regression tolerance# NOTIFICATION ONLY (warn but don't block)soft_warnings:
ndcg_at_5: 0.01# 1% retrieval degradation triggers Slack alert
14
Dashboards & Alerting
// PRODUCTION QUALITY MONITORING
Offline eval catches regressions before release. Online monitoring catches degradation after release — distribution shift, new query types, seasonal patterns, upstream model changes. Both are required for a production LLM system.
Key Production Metrics to Track
Rolling faithfulness score (1-hour window) — primary quality signal
Null-answer rate — "I don't know" responses = retrieval failure
User negative feedback rate — thumbs-down as proxy signal
P95 latency per component — embed, retrieve, generate separately
Token cost per query — prompt + completion token spend
Context utilization — % of retrieved context referenced in answer
Alerting Thresholds
P0 — Immediate page: Faithfulness < 0.65 over 30-min window
P1 — Slack alert: Faithfulness drops 15% vs 7-day rolling baseline
P2 — Daily digest: Cost per query increases >20% vs prior week
Weekly review: RAGAS full eval run against validation set, trend chart
Monthly: Human eval sample of 50 queries, calibrate automated metrics
Python — Phoenix OpenTelemetry Instrumentation
import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.langchain import LangChainInstrumentor
# Start Phoenix server (or connect to hosted Arize Phoenix)
session = px.launch_app() # starts at http://localhost:6006# Register OpenTelemetry tracer — instruments all LLM calls automatically
tracer_provider = register(
project_name="product-qa-rag-prod",
endpoint="http://localhost:4317",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
# From now on, every LLM call is automatically traced.
# No code changes to your application required.
#
# Phoenix captures:
# - Full prompt + completion text
# - Retrieved chunks with scores
# - Latency per span (embed / retrieve / generate)
# - Token counts and estimated cost
# - User feedback (if you add feedback spans)
# Run eval on traced data — evaluate production samplefrom phoenix.evals import (
HallucinationEvaluator,
QAEvaluator,
RelevanceEvaluator,
run_evals,
)
ds = px.Client().get_spans_dataframe(project_name="product-qa-rag-prod")
hallucination_eval, qa_eval, relevance_eval = run_evals(
dataframe=ds,
evaluators=[HallucinationEvaluator(), QAEvaluator(), RelevanceEvaluator()],
provide_explanation=True,
)
15
Eval Maturity Model
// WHERE ARE YOU, AND WHAT TO BUILD NEXT
Eval infrastructure matures progressively. Organizations that try to build everything at once ship nothing. Start with manual evaluation, add offline automation, then layer production monitoring. Each stage enables the next.
Stage
Name
Capabilities
Primary Tooling
KPI
01
Ad-hoc
Manual review of outputs. "Looks good to me." No reproducibility. No tracking.
Spreadsheet, notebooks
Vibes
02
Benchmarked
Labeled eval set exists. Can run RAGAS offline. Track scores per version manually.
RAGAS, CSV, notebooks
RAGAS score per version
03
Automated
Eval runs in CI on every PR. Merge blocked on regression. Scores tracked in DB.
DeepEval + GH Actions + TruLens
Zero regressions shipped
04
Monitored
Production trace monitoring. Online eval on sampled traffic. Automated alerts.
Phoenix, LangSmith, Arize
MTTD < 30 min for quality drops
05
Self-improving
Automated failure analysis. Eval data mined from production disagreements. Feedback loop to training data curation.