LLM Evaluation Field Handbook · Updated July 2026

Prove Your RAG
Is Getting Better

A rigorous engineering guide to LLM evaluation — covering RAGAS, TruLens, DeepEval, the mathematics behind retrieval and generation metrics, the 2026 shift toward agentic and tool-use benchmarks, and how to wire it all into CI/CD so quality regressions never ship.

"You can't improve what you can't measure. In LLM systems, this is a mathematical statement, not a platitude."

RAGAS TruLens DeepEval LLM-as-Judge RAG Eval Agentic Benchmarks NIST AI RMF

Why Evaluation Matters

// THE PROBLEM WITH VIBES-BASED DEVELOPMENT

LLM systems fail silently. A prompt change that improves one query type degrades five others. A new retriever version gets better recall but retrieves noisier chunks. Without systematic evaluation, you are flying blind — "it seems better" is not a shipping criterion.

The evaluation gap is the defining engineering challenge of LLM development. Traditional software has deterministic outputs you can assert against. LLMs produce probabilistic, open-ended text — requiring a new discipline that combines information retrieval theory, NLP metrics, and statistical testing.

Offline Evaluation

Run against a held-out labeled dataset before deployment. Catches regressions before users see them. Gives you a reproducible benchmark. Fast feedback loop in CI.

Online Evaluation

Monitor real production traffic with automated quality signals. Detects distribution shift, emerging failure modes, and per-user degradation that offline sets can't anticipate.

Human Evaluation

The ground truth. Use for calibrating automated metrics, adjudicating edge cases, and building preference datasets. Expensive — use strategically, not as a primary loop.

📐

The evaluation triad: Every LLM eval system must track three things simultaneously — quality (does it answer correctly?), safety (does it avoid harmful outputs?), and cost/latency (is it fast and cheap enough?). Optimizing one without tracking the others produces systems that degrade on the ignored dimensions.

Evaluation Taxonomy

// WHAT ARE YOU ACTUALLY MEASURING

Not all eval metrics are created equal. The choice of metric depends on what component you're evaluating, what failure mode you're guarding against, and whether you have ground-truth labels available.

Metric Type	Requires Ground Truth?	What It Measures	Typical Tools
Reference-based	YES	Similarity of generated output to a gold-standard answer (BLEU, ROUGE, BERTScore, Exact Match)	NLTK, HuggingFace evaluate
Reference-free	NO	Internal quality signals: faithfulness, coherence, toxicity — evaluated without a gold answer	RAGAS, TruLens, DeepEval
LLM-as-Judge	OPTIONAL	A strong LLM (GPT-4, Claude) scores the response on defined rubrics. Flexible, scalable, expensive	OpenAI Evals, LangSmith, any RAGAS
Retrieval-specific	YES	Precision, Recall, NDCG, MRR — measures whether the right chunks were retrieved	BEIR, RAGAS, custom IR pipelines
Task-specific	YES	Custom metrics for your domain: F1 on entity extraction, accuracy on MCQ, slot-fill rate for agents	DeepEval, custom, HELM
Safety / Red-team	NO	Jailbreak resistance, hallucination rate on known-false claims, PII leakage, bias metrics	Garak, PromptBench, AI Fairness 360
Trajectory-based (agentic)	OPTIONAL	Whether a multi-step agent took the correct reasoning path to a result — not just whether the final answer was right. Goal completion vs. process correctness are scored separately	τ-bench, GAIA, BFCL, Inspect AI

⚠

The BLEU trap: BLEU and ROUGE were designed for machine translation and summarization in the early 2000s. They measure n-gram overlap, which correlates poorly with quality for modern open-ended LLM outputs. A response that is semantically identical but uses different phrasing scores near zero. Do not use BLEU/ROUGE as primary metrics for RAG QA — use BERTScore, semantic similarity, or LLM-as-judge instead.

RAG Evaluation Primer

// THE FOUR FAILURE MODES OF A RAG PIPELINE

Retrieval-Augmented Generation has two independently-failing components — the retriever and the generator. A bug in either produces wrong answers but in very different ways. You need metrics that isolate each component, plus end-to-end metrics that evaluate the full pipeline.

Input

User Query

→

Retriever

Chunk Selection

→

Context

Retrieved Chunks

→

Generator

LLM Response

→

Output

Final Answer

Retriever Failure Modes Upstream

Low precision: Retrieved chunks are irrelevant noise — LLM hallucinates to fill gaps
Low recall: Relevant chunks are missing — LLM can't answer even with perfect generation
Wrong ordering: Most relevant chunk is position 5/5 — lost-in-the-middle failure
Chunk granularity: Chunks too large (dilution) or too small (missing context)

Generator Failure Modes Downstream

Hallucination: Claims facts not present in retrieved context
Context ignorance: Ignores retrieved context, answers from parametric memory
Faithfulness: Contradicts or misrepresents what the context says
Verbosity / conciseness: Answers are padded or missing key detail

The Core Evaluation Triangle

Faithfulness

% of claims in answer that are grounded in retrieved context

Answer Relevance

How completely and precisely the answer addresses the query

Context Precision

What fraction of retrieved chunks are actually relevant

Context Recall

What fraction of needed info was present in retrieved context

RAGAS

// RETRIEVAL AUGMENTED GENERATION ASSESSMENT

RAGAS (Retrieval Augmented Generation Assessment) is the most widely-adopted framework for reference-free RAG evaluation. It uses an LLM evaluator to score your RAG pipeline across four core metrics without requiring manually-labeled gold answers — making it practical to run at scale.

∑

Reference-free by design: RAGAS's key insight is that you can evaluate RAG quality using only three inputs — the question, the retrieved context, and the generated answer. No ground-truth required for Faithfulness and Answer Relevance. Context Recall does require a ground-truth answer as proxy for "all necessary information."

The Four RAGAS Metrics

Metric 1 — Faithfulness

Faithfulness = |verified_claims| / |total_claims_in_answer|

Decompose answer into atomic claims. LLM judge verifies each claim against retrieved context. Score = fraction verifiable. Ranges [0, 1]. A score below 0.8 indicates significant hallucination.

Metric 2 — Answer Relevance

AR = mean( cosine_similarity( embed(q), embed(q̂ᵢ) ) ) for i in 1..n

Generate n synthetic questions from the answer. Embed the original query and each synthetic question. Mean cosine similarity = how well the answer "covers" the query. Low score = off-topic or incomplete.

Metric 3 — Context Precision

CP@k = Σᵢ₌₁ᵏ (Precisionᵢ × relevantᵢ) / |relevant_chunks_in_top_k|

Weighted precision that rewards relevant chunks appearing at higher positions. Penalizes retrievers that bury the relevant chunk at position k. Equivalent to Average Precision in IR literature.

Metric 4 — Context Recall

CR = |ground_truth_sentences_attributed_to_context| / |ground_truth_sentences|

Decompose the ground-truth answer into sentences. LLM judge checks if each sentence can be attributed to at least one retrieved chunk. Requires a ground-truth answer.

Installation & Quick Start

Python — RAGAS Quick Start

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    answer_correctness,    # requires ground truth
    answer_similarity,     # semantic similarity to ground truth
)
from datasets import Dataset

# Build your evaluation dataset
# Each row: question, answer, contexts (list of chunks), ground_truth
eval_data = {
    "question":     ["What is the return policy?", ...],
    "answer":       ["Returns accepted within 30 days...", ...],
    "contexts":     [["Our return policy states...", "Items must be..."], ...],
    "ground_truth": ["Returns are accepted within 30 days of purchase", ...],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation (uses gpt-4 as judge by default)
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=ChatOpenAI(model="gpt-4o"),          # or Claude, Gemini
    embeddings=OpenAIEmbeddings(),
    raise_exceptions=False,
    show_progress=True,
)

# Results as pandas DataFrame
df = result.to_pandas()
print(result)
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
#  'context_precision': 0.82, 'context_recall': 0.79}

RAGAS Score Interpretation

Metric	≥ 0.9 (Excellent)	0.7–0.9 (Acceptable)	< 0.7 (Needs Work)	Primary Root Cause
Faithfulness	Minimal hallucination	Some ungrounded claims	High hallucination rate	Weak system prompt, too-short context, aggressive summarization
Answer Relevancy	On-topic, complete	Partially addressed	Off-topic / vague	Poor retrieval, bad prompt structure, over-verbose LLM
Context Precision	Retriever very precise	Some noise in context	Too much irrelevant noise	Low similarity threshold, bad chunking, wrong embedding model
Context Recall	All info retrieved	Most info present	Key info missing	top-k too small, chunks too large, suboptimal embedding

TruLens

// FEEDBACK FUNCTIONS & THE RAG TRIAD

TruLens (by TruEra) takes a different architectural approach from RAGAS — it instruments your LLM application at the function level, logging every LLM call, and attaches feedback functions that score interactions. This makes it ideal for continuous monitoring of production systems.

TruLens RAG Triad Core Model

TruLens maps its feedback functions explicitly to three relationships in a RAG pipeline:

Answer Relevance: Answer ↔ Question — does the answer address the question?
Context Relevance: Context ↔ Question — does the retrieved context match what was asked?
Groundedness: Answer ↔ Context — is the answer grounded in (not hallucinated from) the context?

Continuous Recording Production

TruLens wraps your chain/app in a TruChain or TruLlama recorder. Every call is stored in a local SQLite (or cloud) database with full input/output traces, latency, token cost, and computed feedback scores. Inspect in the TruLens dashboard or query programmatically.

Python — TruLens RAG Triad Setup

from trulens.core import TruSession, Feedback
from trulens.providers.openai import OpenAI as TruOpenAI
from trulens.apps.langchain import TruChain
import numpy as np

session = TruSession()          # stores results in local SQLite

provider = TruOpenAI(model_engine="gpt-4o")  # LLM-as-judge provider

# ── Define the three feedback functions ──────────────────────

# 1. Context Relevance: each chunk vs the query
f_context_relevance = (
    Feedback(provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()          # question
    .on(TruChain.select_context())  # each retrieved chunk
    .aggregate(np.mean)   # mean over all chunks
)

# 2. Groundedness: answer claims vs context
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(TruChain.select_context().collect())  # all chunks as list
    .on_output()                                # final answer
)

# 3. Answer Relevance: answer vs question
f_answer_relevance = (
    Feedback(provider.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input_output()
)

# ── Wrap your LangChain RAG chain ────────────────────────────
tru_recorder = TruChain(
    your_langchain_rag_chain,       # your existing chain
    app_name="product-qa-rag",
    app_version="v2.3.1",          # version tag for regression tracking
    feedbacks=[f_context_relevance, f_groundedness, f_answer_relevance],
)

# ── Run evaluation ───────────────────────────────────────────
with tru_recorder as recording:
    for question in eval_questions:
        response = your_langchain_rag_chain.invoke({"query": question})

# View results
session.get_leaderboard()       # compare across app_version tags
session.run_dashboard()         # launch local Streamlit dashboard

TruLens Leaderboard — Version Comparison

The TruLens leaderboard pattern is the key mechanism for proving improvement. Tag each experiment with a version string and compare across runs — the leaderboard shows all three RAG triad scores plus latency and cost per version.

Python — Programmatic Regression Check

# Compare v2.3.1 against v2.3.0 baseline
leaderboard = session.get_leaderboard(app_ids=["product-qa-rag"])
print(leaderboard[["app_version", "Groundedness", "Answer Relevance", "Context Relevance", "latency", "total_cost"]])

# app_version  Groundedness  Answer Relevance  Context Relevance  latency  total_cost
# v2.3.0       0.74          0.82              0.71               1.23s    $0.0042
# v2.3.1       0.91          0.88              0.85               1.31s    $0.0045  ← improved

# Fail CI if any metric regresses more than 5% from baseline
baseline = leaderboard[leaderboard["app_version"] == "v2.3.0"].iloc[0]
current  = leaderboard[leaderboard["app_version"] == "v2.3.1"].iloc[0]

threshold = 0.05
for metric in ["Groundedness", "Answer Relevance", "Context Relevance"]:
    delta = current[metric] - baseline[metric]
    if delta < -threshold:
        raise AssertionError(f"{metric} regressed by {delta:.2%} — blocking merge")

DeepEval

// PYTEST FOR LLM SYSTEMS

DeepEval takes a testing-first philosophy — it integrates directly with pytest and provides a rich library of LLM-specific metrics as assert-able test cases. Think of it as "unit tests for your prompts." It covers RAG metrics, conversational metrics, agentic task metrics, and safety checks in a single framework.

G-Eval Flexible

Define custom evaluation criteria in plain English. The LLM judge generates an evaluation chain-of-thought, then produces a score 0–10. Best for domain-specific quality attributes without coding a custom metric.

DAG Metrics Agentic

Evaluate multi-step agent tasks using a Directed Acyclic Graph of expected tool calls and intermediate states. Measures task completion rate, trajectory correctness, and efficiency.

Conversational Multi-turn

Role-consistency, knowledge retention across turns, and conversation completeness for chatbot applications. Goes beyond single-turn RAG evaluation.

Python — DeepEval Pytest Integration

# test_rag_quality.py  — run with: deepeval test run test_rag_quality.py

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import (
    FaithfulnessMetric,
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    HallucinationMetric,
    ToxicityMetric,
    GEval,
)


@pytest.mark.parametrize("test_case", load_test_cases("eval_dataset.json"))
def test_rag_pipeline(test_case):
    # Run your RAG pipeline on the test case
    result = rag_pipeline.query(test_case.input)

    llm_test_case = LLMTestCase(
        input=test_case.input,
        actual_output=result.answer,
        expected_output=test_case.expected,        # ground truth (optional)
        retrieval_context=result.retrieved_chunks,
        context=test_case.reference_documents,
    )

    assert_test(llm_test_case, [
        FaithfulnessMetric(threshold=0.85, model="gpt-4o"),
        AnswerRelevancyMetric(threshold=0.80),
        ContextualPrecisionMetric(threshold=0.75),
        ContextualRecallMetric(threshold=0.75),
        HallucinationMetric(threshold=0.15),     # hallucination RATE — lower is better
        ToxicityMetric(threshold=0.05),
    ])


# Custom G-Eval — domain-specific quality criterion
conciseness_metric = GEval(
    name="Conciseness",
    model="gpt-4o",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
    evaluation_steps=[
        "Check that the answer does not contain unnecessary padding or repetition",
        "Verify the answer directly addresses the question without preamble",
        "Confirm key information is present without verbosity",
    ],
    threshold=0.7,
)


def test_conciseness():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output=rag_pipeline.query("What is the capital of France?").answer,
    )
    assert_test(test_case, [conciseness_metric])

▸

DeepEval CLI: deepeval test run integrates with CI and outputs structured JSON results. Use deepeval login to sync results to the Confident AI cloud dashboard for trend tracking. The --fail-or-exit-on-metric flag fails the test suite if a specific metric drops below threshold — ideal for PR gates.

HELM, Promptfoo, LangSmith & the 2026 Field

// CHOOSING THE RIGHT TOOL FOR THE JOB

The eval tooling category matured fast between 2024 and 2026 — from two academic references (RAGAS, TruLens) to a full stack of open-source libraries plus a commercial observability tier. No single framework wins every dimension; most mature teams run two or three of these in parallel, one per lifecycle stage.

Framework	Best For	Approach	Self-hosted?	RAG Native?
RAGAS	RAG evaluation, fast offline benchmarking	Reference-free LLM-as-judge	YES	YES
TruLens	Production monitoring, RAG Triad tracing	Feedback functions + OpenTelemetry recording	YES	YES
DeepEval	Test-driven development, CI integration, agent/chatbot eval	Pytest plugin, 50+ metrics	YES	YES
Promptfoo	Prompt regression testing, CLI-driven red-teaming, cross-model matrix testing	Config/YAML-driven comparisons across providers	YES	PARTIAL
Braintrust	Eval-driven development loop, prompt playground + scoring in one workflow	Hosted eval + logging platform, SDK-first	NO	YES
LangSmith	Full LLM ops: tracing + eval + monitoring, dataset curation	Cloud trace store + eval runners + experiments	NO	YES
Phoenix (Arize)	ML observability, embedding drift, OSS production tracing	OpenTelemetry traces + eval functions	YES	YES
W&B Weave	Experiment tracking teams extending into LLM eval	Weave ops tracing + scorer functions	NO	PARTIAL
HELM	Multi-model benchmarking, academic comparison	Task-based standardized scenarios	YES	PARTIAL

📐

Recommended stack (2026 practitioner consensus): RAGAS for fast offline experimentation while tuning chunking/embeddings, DeepEval as the pytest-native CI gate on a curated golden set, and TruLens or Phoenix for production trace monitoring. Set production floors around faithfulness ≥ 0.75, answer relevancy ≥ 0.80, context precision ≥ 0.70, context recall ≥ 0.80 — deliberately asymmetric, since retrieval noise is more tolerable than an ungrounded claim. These tools are complementary, not competing — a faithfulness regression caught in DeepEval maps directly onto a groundedness dip in TruLens.

⚠

TruLens post-acquisition note: Snowflake acquired TruEra (TruLens's creator) in 2024. TruLens remains open-source and self-hostable, but its release cadence has slowed relative to DeepEval and Promptfoo. It is still a strong pick for teams that want evaluation and OpenTelemetry tracing unified in one workflow, particularly in regulated industries with an existing Snowflake footprint.

⚠

The shared blind spot: RAGAS, TruLens, and DeepEval all operate at the inference layer — they can catch a response that contradicts its retrieved context. None of them can catch a response that faithfully repeats retrieved context that was itself wrong or stale (a stale metric definition in your index, an outdated policy doc). Faithfulness scores stay high while the answer is confidently incorrect. Catching that class of failure requires context/data-layer governance, not inference-layer eval — a separate discipline from everything else in this handbook.

Retrieval Metrics Deep-Dive

// THE MATHEMATICS OF FINDING THE RIGHT CHUNK

Before evaluating the generator, validate the retriever in isolation. These metrics come from classical Information Retrieval and have precise mathematical definitions. Use the BEIR benchmark for zero-shot evaluation of your embedding model.

Precision@k — What fraction of top-k retrieved docs are relevant?

Precision@k = |relevant_docs ∩ retrieved_top_k| / k

Use when the cost of irrelevant context is high (LLM gets confused by noise). A retriever with P@5=0.6 means 3 of 5 retrieved chunks are relevant.

Recall@k — What fraction of all relevant docs are in top-k?

Recall@k = |relevant_docs ∩ retrieved_top_k| / |all_relevant_docs|

Use when missing relevant context is catastrophic (can't answer without it). For QA, recall is typically more important than precision — retrieve more, let the LLM filter.

Mean Reciprocal Rank — How high is the FIRST relevant doc?

MRR = (1/|Q|) × Σᵢ (1 / rankᵢ)

rankᵢ = position of first relevant document for query i. MRR penalizes systems that bury the answer at position 4 or 5. Critical for "lost-in-the-middle" detection.

NDCG@k — Normalized Discounted Cumulative Gain (graded relevance)

DCG@k = Σᵢ₌₁ᵏ (relevanceᵢ) / log₂(i + 1) NDCG@k = DCG@k / IDCG@k

IDCG = ideal DCG (perfect ranking). NDCG supports graded relevance (0=irrelevant, 1=somewhat, 2=highly relevant). The log₂ denominator discounts lower positions — getting relevant docs to rank 1 matters more than rank 5.

Retriever Benchmarking with BEIR

Python — BEIR Zero-shot Eval

from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval import models
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

# Load a BEIR dataset (e.g., FIQA for financial QA, NFCorpus for medical)
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip"
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader.load(data_path)

# Swap in your embedding model
model = DRES(models.SentenceBERT("BAAI/bge-large-en-v1.5"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim", k_values=[1, 3, 5, 10])
results = retriever.retrieve(corpus, queries)

ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
print("NDCG@10:", ndcg["NDCG@10"])    # 0.412 (BGE-large on FIQA)
print("Recall@5:", recall["Recall@5"])  # 0.589

Generation Metrics Deep-Dive

// BEYOND BLEU — SEMANTIC SIMILARITY AND NLI

Generation quality metrics fall into two families: lexical (token overlap — fast, cheap, poor correlation with quality) and semantic (embedding/model-based — expensive, strong correlation). Use semantic metrics for production evaluation.

Metric	Method	Requires GT?	Correlation w/ Human	Cost
BLEU	n-gram precision vs reference	YES	Low (QA)	Free
ROUGE-L	Longest common subsequence	YES	Medium (summ.)	Free
BERTScore	Contextual embedding similarity (F1)	YES	High	Low (local model)
Semantic Similarity	Cosine sim of sentence embeddings	YES	High	Low (local model)
NLI Entailment	NLI model: does context entail answer?	NO	High	Low (local model)
LLM-as-Judge	GPT-4/Claude rates on rubric	OPTIONAL	Very High	High ($)

BERTScore — token-level precision, recall, F1 using BERT embeddings

P_BERT = (1/|ŷ|) × Σ_{ŷₜ ∈ ŷ} max_{yₜ ∈ y} cosine(ŷₜ, yₜ) R_BERT = (1/|y|) × Σ_{yₜ ∈ y} max_{ŷₜ ∈ ŷ} cosine(yₜ, ŷₜ) F_BERT = 2 × (P_BERT × R_BERT) / (P_BERT + R_BERT)

ŷ = generated tokens, y = reference tokens. Each token is matched to its closest semantic counterpart in the other sequence. F_BERT ≥ 0.85 on DeBERTa typically indicates high quality.

Python — Generation Metrics Suite

from bert_score import score as bertscore
from sentence_transformers import SentenceTransformer, util as st_util
from transformers import pipeline

# ── BERTScore ─────────────────────────────────────────────────
predictions = ["The return window is 30 days from purchase."]
references  = ["Items can be returned within 30 days of the purchase date."]

P, R, F1 = bertscore(predictions, references, model_type="microsoft/deberta-xlarge-mnli")
print(f"BERTScore F1: {F1.mean():.4f}")   # → 0.9341

# ── Semantic Similarity ───────────────────────────────────────
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
emb_pred = embedder.encode(predictions, convert_to_tensor=True)
emb_ref  = embedder.encode(references,  convert_to_tensor=True)
sim = st_util.cos_sim(emb_pred, emb_ref)
print(f"Semantic Similarity: {sim.item():.4f}")   # → 0.9627

# ── NLI Entailment (reference-free faithfulness) ───────────────
nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-small")
context  = "Our return policy allows returns within 30 days of purchase."
answer   = "You have 30 days to return any item."

result = nli(f"{context} [SEP] {answer}")[0]
print(result)   # {'label': 'ENTAILMENT', 'score': 0.9812}
# ENTAILMENT = answer is grounded in context  ✓
# CONTRADICTION = answer contradicts context  ✗ (hallucination)
# NEUTRAL = answer is neither supported nor contradicted

LLM-as-Judge

// USING AI TO EVALUATE AI — RESPONSIBLY

LLM-as-judge is currently the highest-quality scalable evaluation method. A strong judge model (Claude Opus, GPT-5-class, or Gemini 3 Pro) evaluates responses against a structured rubric. The quality of your evaluation is determined by the quality of your rubric — vague rubrics produce noisy, unreliable scores.

LLM Judge Failure Modes Know These

Verbosity bias: Longer answers score higher regardless of quality
Self-preference: GPT-4 prefers GPT-4 outputs; Claude prefers Claude outputs
Position bias: In pairwise evals, Option A scores higher if listed first
Sycophancy: Agrees with user opinion expressed in the prompt
Format sensitivity: Score changes based on bullet vs. prose presentation

Mitigations Best Practices

Chain-of-thought: Ask judge to reason step-by-step before scoring
Calibration examples: Include 2–3 scored examples (few-shot) in system prompt
Score swapping: Run pairwise eval twice, swap A/B positions, take mean
Use a judge from a different model family than the generator — same-provider judges are measurably too forgiving of their own family's outputs
Validate against human labels — measure judge-human correlation (target ≥ 0.75 Spearman)

Python — Production-Grade LLM Judge

# High-quality rubric with chain-of-thought scoring
JUDGE_SYSTEM_PROMPT = """You are an expert evaluator for a customer support RAG system.
Evaluate the assistant's response on FAITHFULNESS — the degree to which
every factual claim in the response is directly supported by the provided context.

Scoring rubric:
  5 — Every claim is explicitly supported by the context. No extrapolation.
  4 — Nearly all claims supported; one minor inference beyond the context.
  3 — Most claims supported; one unsupported factual claim present.
  2 — Several claims not grounded in context; noticeable hallucination.
  1 — Response contains significant fabricated information.

Calibration examples:
  Context: "Shipping takes 3-5 business days."
  Response: "Your order will arrive in 3 to 5 business days."
  Score: 5 — Direct restatement, fully grounded.

  Context: "Shipping takes 3-5 business days."
  Response: "Your order ships from our Chicago warehouse in 3-5 days."
  Score: 2 — 'Chicago warehouse' is fabricated.

You MUST respond with valid JSON only: {"reasoning": "...", "score": N}"""

def judge_faithfulness(question: str, context: str, answer: str) -> dict:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": f"""
Question: {question}
Context: {context}
Response to evaluate: {answer}

Evaluate faithfulness. Respond with JSON only."""},
        ],
        temperature=0.0,   # deterministic — critical for reproducibility
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

# Returns: {"reasoning": "The answer states '30 days' which matches...", "score": 5}

Evaluation Dataset Construction

// GARBAGE IN, GARBAGE OUT — YOUR EVAL SET IS YOUR SPEC

Your evaluation dataset defines what "quality" means for your system. A weak eval set gives you false confidence. A strong eval set is the most valuable artifact in your LLM development process — treat it like production code: version it, review it, maintain it.

Synthetic Generation (RAGAS TestsetGenerator)

Bootstrap your eval set from your own document corpus. RAGAS generates question/answer pairs from your documents using an LLM, covering simple factual, multi-context reasoning, and conditional question types. Rapid start — validate samples before trusting scores.

Real User Queries (Production Mining)

Sample real queries from production logs. These represent actual user intent distribution — far more valuable than synthetic. Label a random sample weekly. 200 labeled examples is a practical minimum for statistically meaningful metrics.

Python — Synthetic Eval Set via RAGAS

from ragas.testset import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_community.document_loaders import DirectoryLoader

# Load your knowledge base documents
loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

generator = TestsetGenerator.with_openai(
    generator_llm="gpt-4o",
    critic_llm="gpt-4o",          # critic filters low-quality questions
    embeddings="text-embedding-3-small",
)

testset = generator.generate_with_langchain_docs(
    documents,
    test_size=200,
    distributions={
        simple:        0.5,    # 50% direct factual questions
        reasoning:     0.25,   # 25% multi-hop reasoning
        multi_context: 0.25,   # 25% require multiple chunks
    },
    with_debugging_logs=True,
)

df = testset.to_pandas()
df.to_csv("eval_dataset_v1.csv", index=False)

# Schema: question | contexts | ground_truth | evolution_type | metadata
# IMPORTANT: Human review minimum 10% of samples before using as ground truth

⚠

Eval set contamination: Never tune prompts or hyperparameters against your primary eval set — that's the test set, not the validation set. Use a three-way split: dev set for iteration, validation set for go/no-go decisions on each PR, test set for quarterly benchmarks and external reporting only. Contamination makes your metrics meaningless.

Regression Testing

// PROVING IMPROVEMENT MATHEMATICALLY

A higher score on a new version doesn't prove improvement unless it's statistically significant. Random variation in LLM outputs and in the LLM judge means small deltas are noise. Use statistical tests to distinguish signal from variance.

Paired t-test — Is the score difference significant?

t = (x̄_diff) / (s_diff / √n) where x̄_diff = mean(scoreᵢ_new − scoreᵢ_old)

Use paired t-test (same queries, two versions) for continuous metrics (faithfulness, relevance). p < 0.05 with n ≥ 50 is a reasonable threshold. Use Wilcoxon signed-rank test for non-normal distributions.

McNemar's Test — Pass/fail metric comparisons (binary)

χ² = (b − c)² / (b + c) where b = v1 pass & v2 fail count, c = v1 fail & v2 pass count

Use for binary metrics (passed threshold / failed threshold). Compares two classifiers on the same test set. χ² > 3.84 → p < 0.05 for 1 degree of freedom.

Python — Statistical Regression Test

from scipy import stats
import numpy as np
import pandas as pd

def check_regression(
    scores_baseline: list[float],
    scores_new: list[float],
    metric_name: str,
    alpha: float = 0.05,
    regression_threshold: float = 0.03,   # alert if delta < -3%
    improvement_threshold: float = 0.03,  # celebrate if delta > +3%
) -> dict:
    # Paired t-test (same questions, both versions)
    t_stat, p_value = stats.ttest_rel(scores_new, scores_baseline)
    delta = np.mean(scores_new) - np.mean(scores_baseline)
    significant = p_value < alpha

    result = {
        "metric":      metric_name,
        "baseline":    np.mean(scores_baseline),
        "new":         np.mean(scores_new),
        "delta":       delta,
        "p_value":     p_value,
        "significant": significant,
        "verdict":     "UNKNOWN",
    }

    if significant and delta < -regression_threshold:
        result["verdict"] = "❌ REGRESSION — block merge"
    elif significant and delta > improvement_threshold:
        result["verdict"] = "✅ IMPROVEMENT — safe to ship"
    elif not significant:
        result["verdict"] = "⚪ NO SIGNIFICANT CHANGE"
    else:
        result["verdict"] = "🟡 MARGINAL — review manually"

    return result

# Example output:
# metric: faithfulness | baseline: 0.841 | new: 0.891 | delta: +0.050
# p_value: 0.0031 | significant: True | verdict: ✅ IMPROVEMENT

CI/CD Integration

// EVAL GATES THAT BLOCK REGRESSIONS FROM SHIPPING

Eval as a CI gate means every pull request that touches prompts, retriever configuration, embedding models, or chunking logic must pass a defined eval threshold before it can merge. This transforms eval from an ad-hoc activity into a quality guarantee.

YAML — GitHub Actions Eval Pipeline

name: LLM Eval Gate

on:
  pull_request:
    paths:
      - 'src/prompts/**'
      - 'src/retriever/**'
      - 'config/rag_config.yaml'
      - 'requirements.txt'

jobs:
  eval:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with: { python-version: '3.11' }

      - name: Install dependencies
        run: pip install -r requirements-eval.txt

      # Fast smoke tests — run on every PR (< 5 min, 30 queries)
      - name: DeepEval smoke test
        run: deepeval test run tests/eval/smoke/ --exit-on-first-failure
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

      # Full regression suite — run only if smoke passes (< 20 min, 200 queries)
      - name: Full RAGAS regression evaluation
        run: |
          python scripts/run_ragas_eval.py \
            --dataset eval_data/validation_set_v3.json \
            --baseline-version ${{ github.base_ref }} \
            --output results/ragas_${{ github.sha }}.json

      - name: Check regression thresholds
        run: |
          python scripts/check_thresholds.py \
            --results results/ragas_${{ github.sha }}.json \
            --thresholds config/eval_thresholds.yaml
        # Exits non-zero if any metric regresses > threshold → blocks merge

      - name: Comment results on PR
        uses: actions/github-script@v7
        if: always()
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('results/ragas_${{ github.sha }}.json'));
            const body = formatResultsTable(results);  // helper function
            github.rest.issues.createComment({ issue_number: context.issue.number, body });

YAML — Eval Thresholds Config

# config/eval_thresholds.yaml
# HARD BLOCK: metric must not drop below absolute floor
absolute_floors:
  faithfulness:        0.80    # never ship below 80% faithfulness
  answer_relevancy:    0.75
  context_precision:   0.70
  context_recall:      0.70
  toxicity_rate:       0.02   # max 2% toxic outputs

# REGRESSION BLOCK: metric must not drop more than N% vs baseline
regression_tolerances:
  faithfulness:        0.03   # 3% regression tolerance
  answer_relevancy:    0.04
  context_precision:   0.05
  context_recall:      0.05
  p50_latency_ms:      200    # 200ms latency regression tolerance
  cost_per_query_usd:  0.002  # $0.002 cost regression tolerance

# NOTIFICATION ONLY (warn but don't block)
soft_warnings:
  ndcg_at_5:           0.01   # 1% retrieval degradation triggers Slack alert

Dashboards & Alerting

// PRODUCTION QUALITY MONITORING

Offline eval catches regressions before release. Online monitoring catches degradation after release — distribution shift, new query types, seasonal patterns, upstream model changes. Both are required for a production LLM system.

Key Production Metrics to Track

Rolling faithfulness score (1-hour window) — primary quality signal
Null-answer rate — "I don't know" responses = retrieval failure
User negative feedback rate — thumbs-down as proxy signal
P95 latency per component — embed, retrieve, generate separately
Token cost per query — prompt + completion token spend
Context utilization — % of retrieved context referenced in answer

Alerting Thresholds

P0 — Immediate page: Faithfulness < 0.65 over 30-min window
P1 — Slack alert: Faithfulness drops 15% vs 7-day rolling baseline
P2 — Daily digest: Cost per query increases >20% vs prior week
Weekly review: RAGAS full eval run against validation set, trend chart
Monthly: Human eval sample of 50 queries, calibrate automated metrics

Python — Phoenix OpenTelemetry Instrumentation

import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from openinference.instrumentation.langchain import LangChainInstrumentor

# Start Phoenix server (or connect to hosted Arize Phoenix)
session = px.launch_app()   # starts at http://localhost:6006

# Register OpenTelemetry tracer — instruments all LLM calls automatically
tracer_provider = register(
    project_name="product-qa-rag-prod",
    endpoint="http://localhost:4317",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

# From now on, every LLM call is automatically traced.
# No code changes to your application required.
#
# Phoenix captures:
#   - Full prompt + completion text
#   - Retrieved chunks with scores
#   - Latency per span (embed / retrieve / generate)
#   - Token counts and estimated cost
#   - User feedback (if you add feedback spans)

# Run eval on traced data — evaluate production sample
from phoenix.evals import (
    HallucinationEvaluator,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)

ds = px.Client().get_spans_dataframe(project_name="product-qa-rag-prod")
hallucination_eval, qa_eval, relevance_eval = run_evals(
    dataframe=ds,
    evaluators=[HallucinationEvaluator(), QAEvaluator(), RelevanceEvaluator()],
    provide_explanation=True,
)

Agentic & Tool-Use Evaluation

// 2026 — WHEN THE OUTPUT IS A TRAJECTORY, NOT A STRING

Through 2025 and into 2026, the center of gravity in LLM evaluation shifted from single-turn RAG QA to multi-step, tool-using agents — coding agents, browser agents, and autonomous loops running for hours without a human in the middle. This breaks the evaluation model from earlier sections: the object being scored is no longer a string, it's a trajectory — a sequence of tool calls, intermediate reasoning, and state changes. An agent can reach the right final answer through an unreliable path, which is a production liability even when the output looks correct.

Goal Completion Outcome

Did the agent achieve the stated objective? Binary or graded pass/fail against a verifiable end-state — "all tests pass," "ticket closed with correct resolution," "booking confirmed with correct itinerary." This is the metric most public leaderboards report.

Trajectory Accuracy Process

Did the agent take a sound reasoning and tool-call path to get there? Measures unnecessary tool calls, policy violations mid-task, and recovery behavior after an error. An agent that stumbles into the right answer via a fragile path will fail the next similar task.

The Benchmark Landscape

Benchmark	Domain	What It Measures	Notes
GAIA	General assistant tasks	466 real-world questions requiring reasoning, multimodality, web browsing, and tool use	From Meta/HuggingFace/AutoGPT authors; scores vary 45–92% across leaderboards depending on scaffolding allowed
τ-bench (tau-bench)	Multi-turn service agents	375+ multi-turn tasks across retail/airline domains with policy-compliance checks; pass@k reliability	Princeton/Sierra; industry reference for conversational agent reliability
SWE-bench (Pro / Live)	Software engineering	Real GitHub issues resolved end-to-end with execution-based testing	Contamination-resistant variants (Pro, Live) emerged in 2026 as the original saturated
BFCL	Function / tool calling	Berkeley Function-Calling Leaderboard — accuracy of structured tool-call generation across languages	De facto standard for scoring raw function-calling reliability
Terminal-Bench / OSWorld	Computer & GUI use	Command-line and desktop GUI automation tasks completed end-to-end	Closest proxy for "can this agent actually operate a computer"
MCP-Atlas	Tool coordination	Coordination accuracy across multiple MCP servers in a single task	Emerged alongside MCP's adoption as the connector standard

⚠

The lab-to-production gap is real and large. 2026 field data on enterprise agentic deployments shows roughly a 37-point gap between published benchmark scores and real-world task success, with up to 50× cost variance between agents scoring similarly on the same benchmark. Treat public leaderboard numbers as an upper bound on what your production configuration — your prompts, your tools, your context — will actually achieve. Always run your own held-out task suite before trusting a benchmark score.

Scoring an Agent Trajectory

Python — Trajectory + Outcome Scoring Pattern

# Dual-metric pattern: score the destination AND the path taken.
# Frameworks like Inspect AI, DeepEval's agentic metrics, and custom
# harnesses (Hermes-style loops) converge on this shape in 2026.

def score_agent_run(trajectory: list[dict], expected_goal: str) -> dict:
    # trajectory = ordered list of {tool, args, result, reasoning} steps

    # 1. Goal completion — deterministic check against the verifiable end state
    goal_completion = verify_goal_state(trajectory[-1]["result"], expected_goal)

    # 2. Trajectory accuracy — did it take a defensible path?
    redundant_calls   = count_redundant_tool_calls(trajectory)
    policy_violations = check_policy_compliance(trajectory)   # e.g. forbidden tools used
    recovery_quality  = score_error_recovery(trajectory)      # graceful retry vs. thrashing

    # 3. Efficiency — turns, tokens, wall-clock, and $ cost to reach the goal
    efficiency = {
        "turns": len(trajectory),
        "tokens": sum(s["tokens"] for s in trajectory),
        "cost_usd": sum(s["cost"] for s in trajectory),
    }

    return {
        "goal_completion": goal_completion,          # bool — the leaderboard number
        "trajectory_ok": redundant_calls == 0 and policy_violations == 0,
        "recovery_quality": recovery_quality,        # 0-1 — did it self-correct cleanly?
        "efficiency": efficiency,
        # A run can be goal_completion=True and trajectory_ok=False —
        # that's the case that hides in aggregate pass-rate dashboards.
    }

▸

Don't let the agent grade itself. The same self-assessment problem from Section 06 applies with higher stakes here — an implementer agent is a structurally poor judge of its own multi-step work. Pair every agentic loop with a separate verifier: a deterministic gate (test suite exit code, compiler, policy checker) wherever one exists, and a different-model-family reviewer only where a deterministic gate is impossible. See the Loop Engineering handbook for the full trigger → execute → verify → memory pattern this feeds into.

Eval Maturity Model

// WHERE ARE YOU, AND WHAT TO BUILD NEXT

Eval infrastructure matures progressively. Organizations that try to build everything at once ship nothing. Start with manual evaluation, add offline automation, then layer production monitoring. Each stage enables the next.

Stage	Name	Capabilities	Primary Tooling	KPI
01	Ad-hoc	Manual review of outputs. "Looks good to me." No reproducibility. No tracking.	Spreadsheet, notebooks	Vibes
02	Benchmarked	Labeled eval set exists. Can run RAGAS offline. Track scores per version manually.	RAGAS, CSV, notebooks	RAGAS score per version
03	Automated	Eval runs in CI on every PR. Merge blocked on regression. Scores tracked in DB.	DeepEval + GH Actions + TruLens	Zero regressions shipped
04	Monitored	Production trace monitoring. Online eval on sampled traffic. Automated alerts.	Phoenix, LangSmith, Arize	MTTD < 30 min for quality drops
05	Self-improving	Automated failure analysis. Eval data mined from production disagreements. Feedback loop to training data curation.	Custom + RLHF pipeline	Weekly eval score trend > 0

▸

Reference implementations & further reading: github.com/explodinggradients/ragas · github.com/truera/trulens · github.com/confident-ai/deepeval · github.com/promptfoo/promptfoo · github.com/Arize-ai/phoenix · BEIR benchmark: github.com/beir-cellar/beir · MTEB (embedding model leaderboard): huggingface.co/spaces/mteb/leaderboard · HELM: crfm.stanford.edu/helm · Garak (LLM safety eval): github.com/leondz/garak · τ-bench: github.com/sierra-research/tau-bench · GAIA leaderboard: huggingface.co/spaces/gaia-benchmark/leaderboard · Inspect AI (agent eval harness): github.com/UKGovernmentBEIS/inspect_ai