Back to handbooks index
Eval Lab LLM Evaluation & Testing
DOC-LLMEVAL-2025 v1.0
LLM Evaluation Field Handbook

Prove Your RAG
Is Getting Better

A rigorous engineering guide to LLM evaluation — covering RAGAS, TruLens, DeepEval, the mathematics behind retrieval and generation metrics, and how to wire it all into CI/CD so quality regressions never ship.

"You can't improve what you can't measure. In LLM systems, this is a mathematical statement, not a platitude."
RAGAS TruLens DeepEval LLM-as-Judge RAG Eval NIST AI RMF
01

Why Evaluation Matters

// THE PROBLEM WITH VIBES-BASED DEVELOPMENT

LLM systems fail silently. A prompt change that improves one query type degrades five others. A new retriever version gets better recall but retrieves noisier chunks. Without systematic evaluation, you are flying blind — "it seems better" is not a shipping criterion.

The evaluation gap is the defining engineering challenge of LLM development. Traditional software has deterministic outputs you can assert against. LLMs produce probabilistic, open-ended text — requiring a new discipline that combines information retrieval theory, NLP metrics, and statistical testing.

Offline Evaluation

Run against a held-out labeled dataset before deployment. Catches regressions before users see them. Gives you a reproducible benchmark. Fast feedback loop in CI.

Online Evaluation

Monitor real production traffic with automated quality signals. Detects distribution shift, emerging failure modes, and per-user degradation that offline sets can't anticipate.

Human Evaluation

The ground truth. Use for calibrating automated metrics, adjudicating edge cases, and building preference datasets. Expensive — use strategically, not as a primary loop.

📐
The evaluation triad: Every LLM eval system must track three things simultaneously — quality (does it answer correctly?), safety (does it avoid harmful outputs?), and cost/latency (is it fast and cheap enough?). Optimizing one without tracking the others produces systems that degrade on the ignored dimensions.
02

Evaluation Taxonomy

// WHAT ARE YOU ACTUALLY MEASURING

Not all eval metrics are created equal. The choice of metric depends on what component you're evaluating, what failure mode you're guarding against, and whether you have ground-truth labels available.

Metric TypeRequires Ground Truth?What It MeasuresTypical Tools
Reference-based YES Similarity of generated output to a gold-standard answer (BLEU, ROUGE, BERTScore, Exact Match) NLTK, HuggingFace evaluate
Reference-free NO Internal quality signals: faithfulness, coherence, toxicity — evaluated without a gold answer RAGAS, TruLens, DeepEval
LLM-as-Judge OPTIONAL A strong LLM (GPT-4, Claude) scores the response on defined rubrics. Flexible, scalable, expensive OpenAI Evals, LangSmith, any RAGAS
Retrieval-specific YES Precision, Recall, NDCG, MRR — measures whether the right chunks were retrieved BEIR, RAGAS, custom IR pipelines
Task-specific YES Custom metrics for your domain: F1 on entity extraction, accuracy on MCQ, slot-fill rate for agents DeepEval, custom, HELM
Safety / Red-team NO Jailbreak resistance, hallucination rate on known-false claims, PII leakage, bias metrics Garak, PromptBench, AI Fairness 360
The BLEU trap: BLEU and ROUGE were designed for machine translation and summarization in the early 2000s. They measure n-gram overlap, which correlates poorly with quality for modern open-ended LLM outputs. A response that is semantically identical but uses different phrasing scores near zero. Do not use BLEU/ROUGE as primary metrics for RAG QA — use BERTScore, semantic similarity, or LLM-as-judge instead.
03

RAG Evaluation Primer

// THE FOUR FAILURE MODES OF A RAG PIPELINE

Retrieval-Augmented Generation has two independently-failing components — the retriever and the generator. A bug in either produces wrong answers but in very different ways. You need metrics that isolate each component, plus end-to-end metrics that evaluate the full pipeline.

Input
User Query
Retriever
Chunk Selection
Context
Retrieved Chunks
Generator
LLM Response
Output
Final Answer
Retriever Failure Modes Upstream
  • Low precision: Retrieved chunks are irrelevant noise — LLM hallucinates to fill gaps
  • Low recall: Relevant chunks are missing — LLM can't answer even with perfect generation
  • Wrong ordering: Most relevant chunk is position 5/5 — lost-in-the-middle failure
  • Chunk granularity: Chunks too large (dilution) or too small (missing context)
Generator Failure Modes Downstream
  • Hallucination: Claims facts not present in retrieved context
  • Context ignorance: Ignores retrieved context, answers from parametric memory
  • Faithfulness: Contradicts or misrepresents what the context says
  • Verbosity / conciseness: Answers are padded or missing key detail

The Core Evaluation Triangle

Faithfulness
F
% of claims in answer that are grounded in retrieved context
Answer Relevance
AR
How completely and precisely the answer addresses the query
Context Precision
CP
What fraction of retrieved chunks are actually relevant
Context Recall
CR
What fraction of needed info was present in retrieved context
04

RAGAS

// RETRIEVAL AUGMENTED GENERATION ASSESSMENT

RAGAS (Retrieval Augmented Generation Assessment) is the most widely-adopted framework for reference-free RAG evaluation. It uses an LLM evaluator to score your RAG pipeline across four core metrics without requiring manually-labeled gold answers — making it practical to run at scale.

Reference-free by design: RAGAS's key insight is that you can evaluate RAG quality using only three inputs — the question, the retrieved context, and the generated answer. No ground-truth required for Faithfulness and Answer Relevance. Context Recall does require a ground-truth answer as proxy for "all necessary information."

The Four RAGAS Metrics

Metric 1 — Faithfulness
Faithfulness = |verified_claims| / |total_claims_in_answer|
Decompose answer into atomic claims. LLM judge verifies each claim against retrieved context. Score = fraction verifiable. Ranges [0, 1]. A score below 0.8 indicates significant hallucination.
Metric 2 — Answer Relevance
AR = mean( cosine_similarity( embed(q), embed(q̂ᵢ) ) ) for i in 1..n
Generate n synthetic questions from the answer. Embed the original query and each synthetic question. Mean cosine similarity = how well the answer "covers" the query. Low score = off-topic or incomplete.
Metric 3 — Context Precision
CP@k = Σᵢ₌₁ᵏ (Precisionᵢ × relevantᵢ) / |relevant_chunks_in_top_k|
Weighted precision that rewards relevant chunks appearing at higher positions. Penalizes retrievers that bury the relevant chunk at position k. Equivalent to Average Precision in IR literature.
Metric 4 — Context Recall
CR = |ground_truth_sentences_attributed_to_context| / |ground_truth_sentences|
Decompose the ground-truth answer into sentences. LLM judge checks if each sentence can be attributed to at least one retrieved chunk. Requires a ground-truth answer.

Installation & Quick Start

Python — RAGAS Quick Start
from ragas import evaluate from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall, answer_correctness, # requires ground truth answer_similarity, # semantic similarity to ground truth ) from datasets import Dataset # Build your evaluation dataset # Each row: question, answer, contexts (list of chunks), ground_truth eval_data = { "question": ["What is the return policy?", ...], "answer": ["Returns accepted within 30 days...", ...], "contexts": [["Our return policy states...", "Items must be..."], ...], "ground_truth": ["Returns are accepted within 30 days of purchase", ...], } dataset = Dataset.from_dict(eval_data) # Run evaluation (uses gpt-4 as judge by default) result = evaluate( dataset=dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], llm=ChatOpenAI(model="gpt-4o"), # or Claude, Gemini embeddings=OpenAIEmbeddings(), raise_exceptions=False, show_progress=True, ) # Results as pandas DataFrame df = result.to_pandas() print(result) # {'faithfulness': 0.87, 'answer_relevancy': 0.91, # 'context_precision': 0.82, 'context_recall': 0.79}

RAGAS Score Interpretation

Metric≥ 0.9 (Excellent)0.7–0.9 (Acceptable)< 0.7 (Needs Work)Primary Root Cause
Faithfulness Minimal hallucination Some ungrounded claims High hallucination rate Weak system prompt, too-short context, aggressive summarization
Answer Relevancy On-topic, complete Partially addressed Off-topic / vague Poor retrieval, bad prompt structure, over-verbose LLM
Context Precision Retriever very precise Some noise in context Too much irrelevant noise Low similarity threshold, bad chunking, wrong embedding model
Context Recall All info retrieved Most info present Key info missing top-k too small, chunks too large, suboptimal embedding
05

TruLens

// FEEDBACK FUNCTIONS & THE RAG TRIAD

TruLens (by TruEra) takes a different architectural approach from RAGAS — it instruments your LLM application at the function level, logging every LLM call, and attaches feedback functions that score interactions. This makes it ideal for continuous monitoring of production systems.

TruLens RAG Triad Core Model

TruLens maps its feedback functions explicitly to three relationships in a RAG pipeline:

  • Answer Relevance: Answer ↔ Question — does the answer address the question?
  • Context Relevance: Context ↔ Question — does the retrieved context match what was asked?
  • Groundedness: Answer ↔ Context — is the answer grounded in (not hallucinated from) the context?
Continuous Recording Production

TruLens wraps your chain/app in a TruChain or TruLlama recorder. Every call is stored in a local SQLite (or cloud) database with full input/output traces, latency, token cost, and computed feedback scores. Inspect in the TruLens dashboard or query programmatically.

Python — TruLens RAG Triad Setup
from trulens.core import TruSession, Feedback from trulens.providers.openai import OpenAI as TruOpenAI from trulens.apps.langchain import TruChain import numpy as np session = TruSession() # stores results in local SQLite provider = TruOpenAI(model_engine="gpt-4o") # LLM-as-judge provider # ── Define the three feedback functions ────────────────────── # 1. Context Relevance: each chunk vs the query f_context_relevance = ( Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance") .on_input() # question .on(TruChain.select_context()) # each retrieved chunk .aggregate(np.mean) # mean over all chunks ) # 2. Groundedness: answer claims vs context f_groundedness = ( Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness") .on(TruChain.select_context().collect()) # all chunks as list .on_output() # final answer ) # 3. Answer Relevance: answer vs question f_answer_relevance = ( Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance") .on_input_output() ) # ── Wrap your LangChain RAG chain ──────────────────────────── tru_recorder = TruChain( your_langchain_rag_chain, # your existing chain app_name="product-qa-rag", app_version="v2.3.1", # version tag for regression tracking feedbacks=[f_context_relevance, f_groundedness, f_answer_relevance], ) # ── Run evaluation ─────────────────────────────────────────── with tru_recorder as recording: for question in eval_questions: response = your_langchain_rag_chain.invoke({"query": question}) # View results session.get_leaderboard() # compare across app_version tags session.run_dashboard() # launch local Streamlit dashboard

TruLens Leaderboard — Version Comparison

The TruLens leaderboard pattern is the key mechanism for proving improvement. Tag each experiment with a version string and compare across runs — the leaderboard shows all three RAG triad scores plus latency and cost per version.

Python — Programmatic Regression Check
# Compare v2.3.1 against v2.3.0 baseline leaderboard = session.get_leaderboard(app_ids=["product-qa-rag"]) print(leaderboard[["app_version", "Groundedness", "Answer Relevance", "Context Relevance", "latency", "total_cost"]]) # app_version Groundedness Answer Relevance Context Relevance latency total_cost # v2.3.0 0.74 0.82 0.71 1.23s $0.0042 # v2.3.1 0.91 0.88 0.85 1.31s $0.0045 ← improved # Fail CI if any metric regresses more than 5% from baseline baseline = leaderboard[leaderboard["app_version"] == "v2.3.0"].iloc[0] current = leaderboard[leaderboard["app_version"] == "v2.3.1"].iloc[0] threshold = 0.05 for metric in ["Groundedness", "Answer Relevance", "Context Relevance"]: delta = current[metric] - baseline[metric] if delta < -threshold: raise AssertionError(f"{metric} regressed by {delta:.2%} — blocking merge")
06

DeepEval

// PYTEST FOR LLM SYSTEMS

DeepEval takes a testing-first philosophy — it integrates directly with pytest and provides a rich library of LLM-specific metrics as assert-able test cases. Think of it as "unit tests for your prompts." It covers RAG metrics, conversational metrics, agentic task metrics, and safety checks in a single framework.

G-Eval Flexible

Define custom evaluation criteria in plain English. The LLM judge generates an evaluation chain-of-thought, then produces a score 0–10. Best for domain-specific quality attributes without coding a custom metric.

DAG Metrics Agentic

Evaluate multi-step agent tasks using a Directed Acyclic Graph of expected tool calls and intermediate states. Measures task completion rate, trajectory correctness, and efficiency.

Conversational Multi-turn

Role-consistency, knowledge retention across turns, and conversation completeness for chatbot applications. Goes beyond single-turn RAG evaluation.

Python — DeepEval Pytest Integration
# test_rag_quality.py — run with: deepeval test run test_rag_quality.py import pytest from deepeval import assert_test from deepeval.test_case import LLMTestCase, LLMTestCaseParams from deepeval.metrics import ( FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric, ContextualRecallMetric, HallucinationMetric, ToxicityMetric, GEval, ) @pytest.mark.parametrize("test_case", load_test_cases("eval_dataset.json")) def test_rag_pipeline(test_case): # Run your RAG pipeline on the test case result = rag_pipeline.query(test_case.input) llm_test_case = LLMTestCase( input=test_case.input, actual_output=result.answer, expected_output=test_case.expected, # ground truth (optional) retrieval_context=result.retrieved_chunks, context=test_case.reference_documents, ) assert_test(llm_test_case, [ FaithfulnessMetric(threshold=0.85, model="gpt-4o"), AnswerRelevancyMetric(threshold=0.80), ContextualPrecisionMetric(threshold=0.75), ContextualRecallMetric(threshold=0.75), HallucinationMetric(threshold=0.15), # hallucination RATE — lower is better ToxicityMetric(threshold=0.05), ]) # Custom G-Eval — domain-specific quality criterion conciseness_metric = GEval( name="Conciseness", model="gpt-4o", evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT], evaluation_steps=[ "Check that the answer does not contain unnecessary padding or repetition", "Verify the answer directly addresses the question without preamble", "Confirm key information is present without verbosity", ], threshold=0.7, ) def test_conciseness(): test_case = LLMTestCase( input="What is the capital of France?", actual_output=rag_pipeline.query("What is the capital of France?").answer, ) assert_test(test_case, [conciseness_metric])
DeepEval CLI: deepeval test run integrates with CI and outputs structured JSON results. Use deepeval login to sync results to the Confident AI cloud dashboard for trend tracking. The --fail-or-exit-on-metric flag fails the test suite if a specific metric drops below threshold — ideal for PR gates.
07

HELM, OpenAI Evals, LangSmith

// CHOOSING THE RIGHT TOOL FOR THE JOB
FrameworkBest ForApproachSelf-hosted?RAG Native?
RAGAS RAG evaluation, fast offline benchmarking Reference-free LLM-as-judge YES YES
TruLens Production monitoring, experiment tracking Feedback functions + recording YES YES
DeepEval Test-driven development, CI integration Pytest plugin, 30+ metrics YES YES
HELM Multi-model benchmarking, academic comparison Task-based standardized scenarios YES PARTIAL
OpenAI Evals Custom task evals, fine-tune validation YAML-based eval definitions, completions API PARTIAL PARTIAL
LangSmith Full LLM ops: tracing + eval + monitoring Cloud trace store + eval runners NO YES
Phoenix (Arize) ML observability, embedding drift OpenTelemetry traces + eval functions YES YES
PromptFoo Prompt regression testing, red-teaming Config-driven comparisons across providers YES PARTIAL
📐
Recommended stack: Use DeepEval for CI/CD test gates (fast, pytest-native), RAGAS for periodic benchmark runs against your labeled evaluation set, and Phoenix or LangSmith for production trace monitoring. You don't need to pick one — they complement each other.
08

Retrieval Metrics Deep-Dive

// THE MATHEMATICS OF FINDING THE RIGHT CHUNK

Before evaluating the generator, validate the retriever in isolation. These metrics come from classical Information Retrieval and have precise mathematical definitions. Use the BEIR benchmark for zero-shot evaluation of your embedding model.

Precision@k — What fraction of top-k retrieved docs are relevant?
Precision@k = |relevant_docsretrieved_top_k| / k
Use when the cost of irrelevant context is high (LLM gets confused by noise). A retriever with P@5=0.6 means 3 of 5 retrieved chunks are relevant.
Recall@k — What fraction of all relevant docs are in top-k?
Recall@k = |relevant_docsretrieved_top_k| / |all_relevant_docs|
Use when missing relevant context is catastrophic (can't answer without it). For QA, recall is typically more important than precision — retrieve more, let the LLM filter.
Mean Reciprocal Rank — How high is the FIRST relevant doc?
MRR = (1/|Q|) × Σᵢ (1 / rankᵢ)
rankᵢ = position of first relevant document for query i. MRR penalizes systems that bury the answer at position 4 or 5. Critical for "lost-in-the-middle" detection.
NDCG@k — Normalized Discounted Cumulative Gain (graded relevance)
DCG@k = Σᵢ₌₁ᵏ (relevanceᵢ) / log₂(i + 1) NDCG@k = DCG@k / IDCG@k
IDCG = ideal DCG (perfect ranking). NDCG supports graded relevance (0=irrelevant, 1=somewhat, 2=highly relevant). The log₂ denominator discounts lower positions — getting relevant docs to rank 1 matters more than rank 5.

Retriever Benchmarking with BEIR

Python — BEIR Zero-shot Eval
from beir import util, LoggingHandler from beir.datasets.data_loader import GenericDataLoader from beir.retrieval.evaluation import EvaluateRetrieval from beir.retrieval import models from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES # Load a BEIR dataset (e.g., FIQA for financial QA, NFCorpus for medical) url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip" data_path = util.download_and_unzip(url, "datasets") corpus, queries, qrels = GenericDataLoader.load(data_path) # Swap in your embedding model model = DRES(models.SentenceBERT("BAAI/bge-large-en-v1.5"), batch_size=16) retriever = EvaluateRetrieval(model, score_function="cos_sim", k_values=[1, 3, 5, 10]) results = retriever.retrieve(corpus, queries) ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values) print("NDCG@10:", ndcg["NDCG@10"]) # 0.412 (BGE-large on FIQA) print("Recall@5:", recall["Recall@5"]) # 0.589
09

Generation Metrics Deep-Dive

// BEYOND BLEU — SEMANTIC SIMILARITY AND NLI

Generation quality metrics fall into two families: lexical (token overlap — fast, cheap, poor correlation with quality) and semantic (embedding/model-based — expensive, strong correlation). Use semantic metrics for production evaluation.

MetricMethodRequires GT?Correlation w/ HumanCost
BLEU n-gram precision vs reference YES Low (QA) Free
ROUGE-L Longest common subsequence YES Medium (summ.) Free
BERTScore Contextual embedding similarity (F1) YES High Low (local model)
Semantic Similarity Cosine sim of sentence embeddings YES High Low (local model)
NLI Entailment NLI model: does context entail answer? NO High Low (local model)
LLM-as-Judge GPT-4/Claude rates on rubric OPTIONAL Very High High ($)
BERTScore — token-level precision, recall, F1 using BERT embeddings
P_BERT = (1/|ŷ|) × Σ_{ŷₜ ∈ ŷ} max_{yₜ ∈ y} cosine(ŷₜ, yₜ) R_BERT = (1/|y|) × Σ_{yₜ ∈ y} max_{ŷₜ ∈ ŷ} cosine(yₜ, ŷₜ) F_BERT = 2 × (P_BERT × R_BERT) / (P_BERT + R_BERT)
ŷ = generated tokens, y = reference tokens. Each token is matched to its closest semantic counterpart in the other sequence. F_BERT ≥ 0.85 on DeBERTa typically indicates high quality.
Python — Generation Metrics Suite
from bert_score import score as bertscore from sentence_transformers import SentenceTransformer, util as st_util from transformers import pipeline # ── BERTScore ───────────────────────────────────────────────── predictions = ["The return window is 30 days from purchase."] references = ["Items can be returned within 30 days of the purchase date."] P, R, F1 = bertscore(predictions, references, model_type="microsoft/deberta-xlarge-mnli") print(f"BERTScore F1: {F1.mean():.4f}") # → 0.9341 # ── Semantic Similarity ─────────────────────────────────────── embedder = SentenceTransformer("BAAI/bge-large-en-v1.5") emb_pred = embedder.encode(predictions, convert_to_tensor=True) emb_ref = embedder.encode(references, convert_to_tensor=True) sim = st_util.cos_sim(emb_pred, emb_ref) print(f"Semantic Similarity: {sim.item():.4f}") # → 0.9627 # ── NLI Entailment (reference-free faithfulness) ─────────────── nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-small") context = "Our return policy allows returns within 30 days of purchase." answer = "You have 30 days to return any item." result = nli(f"{context} [SEP] {answer}")[0] print(result) # {'label': 'ENTAILMENT', 'score': 0.9812} # ENTAILMENT = answer is grounded in context ✓ # CONTRADICTION = answer contradicts context ✗ (hallucination) # NEUTRAL = answer is neither supported nor contradicted
10

LLM-as-Judge

// USING AI TO EVALUATE AI — RESPONSIBLY

LLM-as-judge is currently the highest-quality scalable evaluation method. A strong judge model (GPT-4o, Claude 3.5 Sonnet) evaluates responses against a structured rubric. The quality of your evaluation is determined by the quality of your rubric — vague rubrics produce noisy, unreliable scores.

LLM Judge Failure Modes Know These
  • Verbosity bias: Longer answers score higher regardless of quality
  • Self-preference: GPT-4 prefers GPT-4 outputs; Claude prefers Claude outputs
  • Position bias: In pairwise evals, Option A scores higher if listed first
  • Sycophancy: Agrees with user opinion expressed in the prompt
  • Format sensitivity: Score changes based on bullet vs. prose presentation
Mitigations Best Practices
  • Chain-of-thought: Ask judge to reason step-by-step before scoring
  • Calibration examples: Include 2–3 scored examples (few-shot) in system prompt
  • Score swapping: Run pairwise eval twice, swap A/B positions, take mean
  • Use different judge than generator — avoid self-evaluation
  • Validate against human labels — measure judge-human correlation (target ≥ 0.75 Spearman)
Python — Production-Grade LLM Judge
# High-quality rubric with chain-of-thought scoring JUDGE_SYSTEM_PROMPT = """You are an expert evaluator for a customer support RAG system. Evaluate the assistant's response on FAITHFULNESS — the degree to which every factual claim in the response is directly supported by the provided context. Scoring rubric: 5 — Every claim is explicitly supported by the context. No extrapolation. 4 — Nearly all claims supported; one minor inference beyond the context. 3 — Most claims supported; one unsupported factual claim present. 2 — Several claims not grounded in context; noticeable hallucination. 1 — Response contains significant fabricated information. Calibration examples: Context: "Shipping takes 3-5 business days." Response: "Your order will arrive in 3 to 5 business days." Score: 5 — Direct restatement, fully grounded. Context: "Shipping takes 3-5 business days." Response: "Your order ships from our Chicago warehouse in 3-5 days." Score: 2 — 'Chicago warehouse' is fabricated. You MUST respond with valid JSON only: {"reasoning": "...", "score": N}""" def judge_faithfulness(question: str, context: str, answer: str) -> dict: response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": JUDGE_SYSTEM_PROMPT}, {"role": "user", "content": f""" Question: {question} Context: {context} Response to evaluate: {answer} Evaluate faithfulness. Respond with JSON only."""}, ], temperature=0.0, # deterministic — critical for reproducibility response_format={"type": "json_object"}, ) return json.loads(response.choices[0].message.content) # Returns: {"reasoning": "The answer states '30 days' which matches...", "score": 5}
11

Evaluation Dataset Construction

// GARBAGE IN, GARBAGE OUT — YOUR EVAL SET IS YOUR SPEC

Your evaluation dataset defines what "quality" means for your system. A weak eval set gives you false confidence. A strong eval set is the most valuable artifact in your LLM development process — treat it like production code: version it, review it, maintain it.

Synthetic Generation (RAGAS TestsetGenerator)

Bootstrap your eval set from your own document corpus. RAGAS generates question/answer pairs from your documents using an LLM, covering simple factual, multi-context reasoning, and conditional question types. Rapid start — validate samples before trusting scores.

Real User Queries (Production Mining)

Sample real queries from production logs. These represent actual user intent distribution — far more valuable than synthetic. Label a random sample weekly. 200 labeled examples is a practical minimum for statistically meaningful metrics.

Python — Synthetic Eval Set via RAGAS
from ragas.testset import TestsetGenerator from ragas.testset.evolutions import simple, reasoning, multi_context from langchain_community.document_loaders import DirectoryLoader # Load your knowledge base documents loader = DirectoryLoader("./docs", glob="**/*.md") documents = loader.load() generator = TestsetGenerator.with_openai( generator_llm="gpt-4o", critic_llm="gpt-4o", # critic filters low-quality questions embeddings="text-embedding-3-small", ) testset = generator.generate_with_langchain_docs( documents, test_size=200, distributions={ simple: 0.5, # 50% direct factual questions reasoning: 0.25, # 25% multi-hop reasoning multi_context: 0.25, # 25% require multiple chunks }, with_debugging_logs=True, ) df = testset.to_pandas() df.to_csv("eval_dataset_v1.csv", index=False) # Schema: question | contexts | ground_truth | evolution_type | metadata # IMPORTANT: Human review minimum 10% of samples before using as ground truth
Eval set contamination: Never tune prompts or hyperparameters against your primary eval set — that's the test set, not the validation set. Use a three-way split: dev set for iteration, validation set for go/no-go decisions on each PR, test set for quarterly benchmarks and external reporting only. Contamination makes your metrics meaningless.
12

Regression Testing

// PROVING IMPROVEMENT MATHEMATICALLY

A higher score on a new version doesn't prove improvement unless it's statistically significant. Random variation in LLM outputs and in the LLM judge means small deltas are noise. Use statistical tests to distinguish signal from variance.

Paired t-test — Is the score difference significant?
t = (x̄_diff) / (s_diff / √n) where x̄_diff = mean(scoreᵢ_new − scoreᵢ_old)
Use paired t-test (same queries, two versions) for continuous metrics (faithfulness, relevance). p < 0.05 with n ≥ 50 is a reasonable threshold. Use Wilcoxon signed-rank test for non-normal distributions.
McNemar's Test — Pass/fail metric comparisons (binary)
χ² = (b − c)² / (b + c) where b = v1 pass & v2 fail count, c = v1 fail & v2 pass count
Use for binary metrics (passed threshold / failed threshold). Compares two classifiers on the same test set. χ² > 3.84 → p < 0.05 for 1 degree of freedom.
Python — Statistical Regression Test
from scipy import stats import numpy as np import pandas as pd def check_regression( scores_baseline: list[float], scores_new: list[float], metric_name: str, alpha: float = 0.05, regression_threshold: float = 0.03, # alert if delta < -3% improvement_threshold: float = 0.03, # celebrate if delta > +3% ) -> dict: # Paired t-test (same questions, both versions) t_stat, p_value = stats.ttest_rel(scores_new, scores_baseline) delta = np.mean(scores_new) - np.mean(scores_baseline) significant = p_value < alpha result = { "metric": metric_name, "baseline": np.mean(scores_baseline), "new": np.mean(scores_new), "delta": delta, "p_value": p_value, "significant": significant, "verdict": "UNKNOWN", } if significant and delta < -regression_threshold: result["verdict"] = "❌ REGRESSION — block merge" elif significant and delta > improvement_threshold: result["verdict"] = "✅ IMPROVEMENT — safe to ship" elif not significant: result["verdict"] = "⚪ NO SIGNIFICANT CHANGE" else: result["verdict"] = "🟡 MARGINAL — review manually" return result # Example output: # metric: faithfulness | baseline: 0.841 | new: 0.891 | delta: +0.050 # p_value: 0.0031 | significant: True | verdict: ✅ IMPROVEMENT
13

CI/CD Integration

// EVAL GATES THAT BLOCK REGRESSIONS FROM SHIPPING

Eval as a CI gate means every pull request that touches prompts, retriever configuration, embedding models, or chunking logic must pass a defined eval threshold before it can merge. This transforms eval from an ad-hoc activity into a quality guarantee.

YAML — GitHub Actions Eval Pipeline
name: LLM Eval Gate on: pull_request: paths: - 'src/prompts/**' - 'src/retriever/**' - 'config/rag_config.yaml' - 'requirements.txt' jobs: eval: runs-on: ubuntu-latest timeout-minutes: 30 steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: { python-version: '3.11' } - name: Install dependencies run: pip install -r requirements-eval.txt # Fast smoke tests — run on every PR (< 5 min, 30 queries) - name: DeepEval smoke test run: deepeval test run tests/eval/smoke/ --exit-on-first-failure env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} # Full regression suite — run only if smoke passes (< 20 min, 200 queries) - name: Full RAGAS regression evaluation run: | python scripts/run_ragas_eval.py \ --dataset eval_data/validation_set_v3.json \ --baseline-version ${{ github.base_ref }} \ --output results/ragas_${{ github.sha }}.json - name: Check regression thresholds run: | python scripts/check_thresholds.py \ --results results/ragas_${{ github.sha }}.json \ --thresholds config/eval_thresholds.yaml # Exits non-zero if any metric regresses > threshold → blocks merge - name: Comment results on PR uses: actions/github-script@v7 if: always() with: script: | const fs = require('fs'); const results = JSON.parse(fs.readFileSync('results/ragas_${{ github.sha }}.json')); const body = formatResultsTable(results); // helper function github.rest.issues.createComment({ issue_number: context.issue.number, body });
YAML — Eval Thresholds Config
# config/eval_thresholds.yaml # HARD BLOCK: metric must not drop below absolute floor absolute_floors: faithfulness: 0.80 # never ship below 80% faithfulness answer_relevancy: 0.75 context_precision: 0.70 context_recall: 0.70 toxicity_rate: 0.02 # max 2% toxic outputs # REGRESSION BLOCK: metric must not drop more than N% vs baseline regression_tolerances: faithfulness: 0.03 # 3% regression tolerance answer_relevancy: 0.04 context_precision: 0.05 context_recall: 0.05 p50_latency_ms: 200 # 200ms latency regression tolerance cost_per_query_usd: 0.002 # $0.002 cost regression tolerance # NOTIFICATION ONLY (warn but don't block) soft_warnings: ndcg_at_5: 0.01 # 1% retrieval degradation triggers Slack alert
14

Dashboards & Alerting

// PRODUCTION QUALITY MONITORING

Offline eval catches regressions before release. Online monitoring catches degradation after release — distribution shift, new query types, seasonal patterns, upstream model changes. Both are required for a production LLM system.

Key Production Metrics to Track
  • Rolling faithfulness score (1-hour window) — primary quality signal
  • Null-answer rate — "I don't know" responses = retrieval failure
  • User negative feedback rate — thumbs-down as proxy signal
  • P95 latency per component — embed, retrieve, generate separately
  • Token cost per query — prompt + completion token spend
  • Context utilization — % of retrieved context referenced in answer
Alerting Thresholds
  • P0 — Immediate page: Faithfulness < 0.65 over 30-min window
  • P1 — Slack alert: Faithfulness drops 15% vs 7-day rolling baseline
  • P2 — Daily digest: Cost per query increases >20% vs prior week
  • Weekly review: RAGAS full eval run against validation set, trend chart
  • Monthly: Human eval sample of 50 queries, calibrate automated metrics
Python — Phoenix OpenTelemetry Instrumentation
import phoenix as px from phoenix.otel import register from openinference.instrumentation.openai import OpenAIInstrumentor from openinference.instrumentation.langchain import LangChainInstrumentor # Start Phoenix server (or connect to hosted Arize Phoenix) session = px.launch_app() # starts at http://localhost:6006 # Register OpenTelemetry tracer — instruments all LLM calls automatically tracer_provider = register( project_name="product-qa-rag-prod", endpoint="http://localhost:4317", ) OpenAIInstrumentor().instrument(tracer_provider=tracer_provider) LangChainInstrumentor().instrument(tracer_provider=tracer_provider) # From now on, every LLM call is automatically traced. # No code changes to your application required. # # Phoenix captures: # - Full prompt + completion text # - Retrieved chunks with scores # - Latency per span (embed / retrieve / generate) # - Token counts and estimated cost # - User feedback (if you add feedback spans) # Run eval on traced data — evaluate production sample from phoenix.evals import ( HallucinationEvaluator, QAEvaluator, RelevanceEvaluator, run_evals, ) ds = px.Client().get_spans_dataframe(project_name="product-qa-rag-prod") hallucination_eval, qa_eval, relevance_eval = run_evals( dataframe=ds, evaluators=[HallucinationEvaluator(), QAEvaluator(), RelevanceEvaluator()], provide_explanation=True, )
15

Eval Maturity Model

// WHERE ARE YOU, AND WHAT TO BUILD NEXT

Eval infrastructure matures progressively. Organizations that try to build everything at once ship nothing. Start with manual evaluation, add offline automation, then layer production monitoring. Each stage enables the next.

StageNameCapabilitiesPrimary ToolingKPI
01 Ad-hoc Manual review of outputs. "Looks good to me." No reproducibility. No tracking. Spreadsheet, notebooks Vibes
02 Benchmarked Labeled eval set exists. Can run RAGAS offline. Track scores per version manually. RAGAS, CSV, notebooks RAGAS score per version
03 Automated Eval runs in CI on every PR. Merge blocked on regression. Scores tracked in DB. DeepEval + GH Actions + TruLens Zero regressions shipped
04 Monitored Production trace monitoring. Online eval on sampled traffic. Automated alerts. Phoenix, LangSmith, Arize MTTD < 30 min for quality drops
05 Self-improving Automated failure analysis. Eval data mined from production disagreements. Feedback loop to training data curation. Custom + RLHF pipeline Weekly eval score trend > 0
Reference implementations & further reading: github.com/explodinggradients/ragas · github.com/truera/trulens · github.com/confident-ai/deepeval · github.com/Arize-ai/phoenix · BEIR benchmark: github.com/beir-cellar/beir · MTEB (embedding model leaderboard): huggingface.co/spaces/mteb/leaderboard · HELM: crfm.stanford.edu/helm · Garak (LLM safety eval): github.com/leondz/garak