Field Handbook · Production AI Systems
Harness
Engineering
"Prompt engineering was 2023. Harness engineering is 2026."
The complete operational guide to building reliable AI systems — covering tool orchestration, multi-layer guardrails, error recovery, feedback loops, observability, and human oversight. Goes beyond individual model calls to the production infrastructure that makes AI agents dependable.
Tool Orchestration
Multi-Agent
Observability
Guardrails
Error Recovery
Human-in-Loop
Harness engineering is the discipline of designing the execution environment, tooling infrastructure, and orchestration logic that surrounds a language model. It is everything that is not the model itself — context assembly, tool dispatch, memory management, guardrails, retry logic, state tracking, output validation, and observability.
The harness converts a probabilistic text generator into a reliable, auditable, goal-directed system. Without a harness, an LLM is a demo. With a production-grade harness, the same model becomes infrastructure.
- Model calls are stateless — no continuity across turns
- Errors surface as hallucinated "success" responses
- Tools invoked with no permission scoping
- No retry, no fallback, no circuit breaking
- Unlimited token burn — no cost guardrails
- Invisible — no tracing, no audit, no replay
- Human oversight requires manual monitoring
- Structured state handoffs between context windows
- Deterministic validation of every output
- Tools scoped to minimum required permissions
- Retry with backoff, fallback strategies, circuit breakers
- Token budgets, rate limits, cost alerting
- Full distributed trace: every decision auditable
- Policy-defined human approval gates
▸
Key insight from 2025 benchmarks: Improving the harness on the same model consistently outperforms switching to a more capable model on real production workloads. The scaffolding is the system — the model is just a component. The same Opus 4.7 running inside different harnesses produces dramatically different reliability profiles.
Three converging forces have made harness engineering the dominant discipline for production AI teams in 2026: models are reasoning-capable by default (reducing prompt ROI), agents are now taking consequential real-world actions (raising the cost of failure), and multi-agent topologies mean errors compound across parallel execution paths.
Prompt ROI Collapse Driver
Stanford HAI (late 2025): marginal returns on prompt optimization have flattened as frontier models reason by default. The leverage has moved entirely to execution infrastructure — context assembly, tool selection, and output validation.
Consequential Actions Driver
Agents now write to databases, push to production branches, send emails, submit PRs, and provision cloud resources. Errors are no longer correctable by re-reading the chat. A harness-less agent with shell access is an uncontrolled blast radius.
Multi-Agent Compounding Driver
Ten agents running in parallel each making small errors creates cascading failures that are nearly impossible to debug post-hoc. The harness provides the isolation, state boundaries, and inter-agent contracts that prevent compound failure.
⚠
OWASP LLM06:2025 — Excessive Agency: The most dangerous harness anti-pattern is over-provisioning. An agent given filesystem write access, network egress, and process execution rights — when it only needs to run tests — is an amplified attack surface. OWASP classifies unnecessary permissions as a top-tier LLM risk. The harness is the enforcement layer that limits what an agent can actually do versus what it thinks it can do.
Every production AI harness — regardless of the model or framework — is composed of three concentric layers. Most teams in 2025 only built Layer 1. Teams shipping reliable production AI in 2026 engineer all three deliberately.
💬
Layer 1: Model Interface
Prompt construction, context assembly, response parsing, token management
⚙️
Layer 2: Runtime Environment
Tool definitions, memory stores, input validation, output guardrails, context window management
🌐
Layer 3: Orchestration
Agent loops, task decomposition, conditional branching, human approval gates, parallel execution, state handoffs
Agent Execution Loop (Plan–Execute–Verify)
→
Plan
Decompose + Select Tools
→
Guardrail
Input Validation
→
→
→
Feedback
Pass / Fix / Escalate
Tools give an agent capabilities beyond text generation — web search, code execution, database queries, file operations, API calls. The harness is responsible for defining the tool registry, enforcing permission scopes, handling tool failures, and composing multi-tool workflows into coherent agent actions.
▸
Start with 3–5 well-defined tools. A lean, well-scoped tool registry outperforms a broad, loosely defined one. Each tool should have: a clear natural-language description for model selection, typed input/output schema, explicit permission scope, timeout budget, and a documented fallback behavior.
Tool Registry Design
from dataclasses import dataclass
from enum import Enum
from typing import Callable, Any
import anthropic
class PermissionLevel(Enum):
READ_ONLY = "read_only" # no side effects
WRITE = "write" # modifies state, reversible
DESTRUCTIVE = "destructive" # requires human approval
@dataclass
class HarnessTool:
name: str
description: str
schema: dict # JSON Schema for inputs
handler: Callable
permission: PermissionLevel
timeout_s: int = 30
max_retries: int = 3
# Tool registry — define capabilities explicitly
TOOL_REGISTRY: dict[str, HarnessTool] = {
"read_file": HarnessTool(
name="read_file",
description="Read the contents of a file. Only files under ./src/ are accessible.",
schema={
"type": "object",
"properties": {
"path": {"type": "string", "pattern": "^\\./src/"}
},
"required": ["path"]
},
handler=handle_read_file,
permission=PermissionLevel.READ_ONLY,
),
"run_tests": HarnessTool(
name="run_tests",
description="Execute the test suite. Returns exit code, stdout, and stderr.",
schema={"type": "object", "properties": {
"test_path": {"type": "string", "default": "tests/"}
}},
handler=handle_run_tests,
permission=PermissionLevel.READ_ONLY,
timeout_s=120,
),
"write_file": HarnessTool(
name="write_file",
description="Write content to a file. Restricted to ./src/. Creates backup before writing.",
schema={
"type": "object",
"properties": {
"path": {"type": "string", "pattern": "^\\./src/"},
"content": {"type": "string"}
},
"required": ["path", "content"]
},
handler=handle_write_file,
permission=PermissionLevel.WRITE,
),
"delete_file": HarnessTool(
name="delete_file",
description="Delete a file. REQUIRES human approval before execution.",
schema={"type": "object", "properties": {
"path": {"type": "string"}
}},
handler=handle_delete_file,
permission=PermissionLevel.DESTRUCTIVE, # always gate to human
),
}
Multi-Tool Workflow: Write–Test–Fix Loop
import anthropic, json, time
from harness import TOOL_REGISTRY, PermissionLevel, require_human_approval
client = anthropic.Anthropic()
def run_agent_loop(task: str, max_iterations: int = 20) -> str:
"""
Bounded Plan–Execute–Verify loop with tool dispatch.
Exits on: task completion, max iterations, or unrecoverable error.
"""
messages = [{"role": "user", "content": task}]
tool_defs = [t_to_anthropic_schema(t) for t in TOOL_REGISTRY.values()]
iteration = 0
while iteration < max_iterations:
iteration += 1
# ── Model call ──────────────────────────────────────────
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=8192,
tools=tool_defs,
messages=messages,
)
# ── Stop conditions ─────────────────────────────────────
if response.stop_reason == "end_turn":
return extract_text(response) # task complete
# ── Tool use ────────────────────────────────────────────
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
tool = TOOL_REGISTRY.get(block.name)
if not tool:
tool_results.append(tool_error(block.id, f"Unknown tool: {block.name}"))
continue
# ── Permission gate ─────────────────────────────────────
if tool.permission == PermissionLevel.DESTRUCTIVE:
approved = require_human_approval(block.name, block.input)
if not approved:
tool_results.append(tool_error(block.id, "Action rejected by operator"))
continue
# ── Execute with timeout + retry ────────────────────────
result = execute_with_retry(tool, block.input)
tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
raise HarnessError(f"Max iterations ({max_iterations}) reached without completion")
def execute_with_retry(tool, inputs, attempt=0):
try:
result = run_with_timeout(tool.handler, inputs, tool.timeout_s)
return json.dumps(result)
except TimeoutError:
return json.dumps({"error": "timeout", "tool": tool.name, "timeout_s": tool.timeout_s})
except Exception as e:
if attempt < tool.max_retries:
time.sleep(2 ** attempt) # exponential backoff
return execute_with_retry(tool, inputs, attempt + 1)
return json.dumps({"error": str(e), "final_attempt": True})
Guardrails are the deterministic rules that prevent agents from taking harmful, unauthorized, or out-of-scope actions. Unlike model-level safety (stochastic), harness guardrails are deterministic and composable. They operate at five distinct intercept points: before input reaches the model, during tool invocation, on output before delivery, at the session boundary, and at the infrastructure level.
🔍
Input Rails
Prompt injection detection, content classification, scope validation
⛓️
Dialog Rails
Conversation flow control, topic boundaries, CoT steering
🔧
Execution Rails
Tool allow/deny lists, parameter validation, sandbox boundaries
📤
Output Rails
Schema validation, linting, test gate, PII scrubbing
💰
Cost Rails
Token budgets, rate limits, per-session cost caps
📁
Data Rails
File path allowlists, secret detection, data classification gates
▸
NVIDIA NeMo Guardrails pattern: Define rails using the Colang DSL at each intercept layer. The execution rail layer specifically governs what tools the LLM can invoke and what their inputs/outputs may contain — the reference for behavioral-level enforcement when static allow/deny lists are insufficient. More constraints yield more reliability, not less.
Guardrail Implementation: CLAUDE.md / AGENTS.md
# CLAUDE.md — Project Harness Configuration
# Loaded automatically at session start by Claude Code
## Project Context
This is a TypeScript Node.js API service. Tests use Vitest. Deployment via GitHub Actions.
## Allowed Actions
- Read and modify files under ./src/ and ./tests/
- Run: npm test, npm run lint, npm run typecheck
- Create new files following the naming convention: kebab-case.ts
## Prohibited Actions
- NEVER modify .env, .env.production, or any secrets file
- NEVER run npm publish, git push, or deployment scripts
- NEVER install new packages without confirming with the user first
- NEVER delete files — move to ./trash/ directory instead
- NEVER expose API keys, tokens, or credentials in any output
## Required Verification Steps
After any code change, you MUST:
1. Run npm run typecheck — zero errors required
2. Run npm test — all tests must pass
3. Run npm run lint — zero warnings for new code
## Output Format
- Always explain the change made and why
- List files modified
- Show test results summary
- Flag any remaining TODOs
## Escalation Triggers
Pause and ask the user before proceeding if:
- Any test failure that you cannot fix in 2 attempts
- A change affects more than 5 files
- You encounter an architectural decision not covered here
Output Validation Gate
import subprocess, re
from dataclasses import dataclass
@dataclass
class ValidationResult:
passed: bool
errors: list[str]
warnings: list[str]
def validate_code_output(code: str, file_path: str) -> ValidationResult:
"""
Multi-layer output validation before accepting agent-written code.
Each layer is deterministic — not reliant on the model.
"""
errors, warnings = [], []
# Layer 1: Secret detection (never commit secrets)
SECRET_PATTERNS = [
r'sk-[A-Za-z0-9]{32,}', # OpenAI / Anthropic keys
r'AKIA[A-Z0-9]{16}', # AWS access key
r'ghp_[A-Za-z0-9]{36}', # GitHub PAT
r'password\s*=\s*["\'][^"\']{8,}', # hardcoded password
]
for pattern in SECRET_PATTERNS:
if re.search(pattern, code, re.IGNORECASE):
errors.append(f"SECRET_DETECTED: pattern '{pattern}' found in output")
# Layer 2: Static analysis (TypeScript example)
if file_path.endswith('.ts'):
r = subprocess.run(['npx', 'tsc', '--noEmit', file_path], capture_output=True, text=True)
if r.returncode != 0:
errors.append(f"TYPE_ERROR: {r.stdout[:500]}")
# Layer 3: Lint
r = subprocess.run(['npx', 'eslint', file_path, '--format=json'], capture_output=True, text=True)
if r.returncode != 0:
warnings.append(f"LINT_WARNING: {r.stdout[:300]}")
# Layer 4: Unit tests (if applicable)
r = subprocess.run(['npm', 'test', '--', '--run'], capture_output=True, text=True, timeout=60)
if r.returncode != 0:
errors.append(f"TEST_FAILURE:\n{r.stdout[-800:]}")
return ValidationResult(passed=len(errors)==0, errors=errors, warnings=warnings)
Agent failures fall into three categories: transient (network timeout, rate limit), recoverable (test failure, type error, tool validation error), and unrecoverable (stuck in a loop, contradiction in task requirements, permission denied). Each category demands a different recovery strategy. The harness must distinguish between them and respond appropriately rather than retrying uniformly.
| Error Type | Examples | Recovery Strategy | Max Attempts |
| Transient |
API timeout, 429 rate limit, network flap |
Exponential backoff with jitter |
3 with backoff |
| Tool Failure |
Tool returns error, invalid output schema |
Return structured error to model; allow re-plan |
Model decides |
| Validation Failure |
Tests fail, typecheck fails, lint errors |
Pass error output back to model as feedback |
2–3 fix attempts |
| Loop Detection |
Same tool called 3x with same inputs |
Break loop, surface to human checkpoint |
1 detection → escalate |
| Budget Exceeded |
Token limit, cost cap, iteration cap hit |
Checkpoint state, pause, notify operator |
Hard stop |
| Unrecoverable |
Conflicting requirements, missing credentials |
Escalate to human with diagnosis |
Immediate |
from collections import Counter, defaultdict
from datetime import datetime, timedelta
class HarnessCircuitBreaker:
"""
Detects and breaks pathological execution patterns before they
exhaust tokens, loop infinitely, or cause runaway tool calls.
"""
def __init__(self):
self.tool_call_history = [] # (tool_name, input_hash, timestamp)
self.iteration_count = 0
self.fix_attempts = defaultdict(int) # tool -> consecutive failures
def record_tool_call(self, tool_name: str, inputs: dict) -> None:
self.tool_call_history.append((tool_name, hash(str(inputs)), datetime.now()))
def check_for_loops(self) -> tuple[bool, str]:
"""Detect identical tool+input pairs in recent history."""
recent = self.tool_call_history[-6:] # last 6 calls
call_counts = Counter((name, h) for name, h, _ in recent)
for (tool, h), count in call_counts.items():
if count >= 3:
return True, f"Loop detected: '{tool}' called {count}x with identical inputs"
return False, ""
def record_fix_failure(self, tool: str) -> bool:
"""Returns True if fix attempts exhausted — escalate to human."""
self.fix_attempts[tool] += 1
return self.fix_attempts[tool] >= 3
def reset_fix_counter(self, tool: str) -> None:
self.fix_attempts[tool] = 0 # success — reset counter
# ─── Recovery message injected back to model ─────────────────────────────
RECOVERY_PROMPT_TEMPLATE = """
The previous action failed. Here is the structured error:
ERROR TYPE: {error_type}
ERROR DETAIL:
{error_detail}
Attempt {attempt} of {max_attempts}.
{"This is your FINAL attempt. If you cannot fix this, respond with ESCALATE: ." if attempt >= max_attempts else ""}
Diagnose the root cause from the error output above and try a different approach.
"""
Feedback loops are the mechanism by which agent output drives the next agent action. The simplest and most effective feedback loop for coding agents is the write–test–fix cycle: the agent writes code, the harness runs tests, failure output is fed back as the next input. This tight loop converts test failures into self-correction signals without human involvement.
01
Write
Agent modifies code
02
Validate
Lint + typecheck (sync)
04
Evaluate
Pass / Fail / Partial
05
Feed Back
Inject error context
06
Fix or Escalate
Retry or → Human
Structured Feedback Injection
def build_feedback_context(validation: ValidationResult, attempt: int) -> str:
"""
Transforms raw validation output into structured feedback the model
can act on. Critical: include the exact error, not a summary of it.
Models fix errors they can see; they hallucinate fixes for errors they can't.
"""
lines = [f"=== VALIDATION FEEDBACK (attempt {attempt}) ==="]
if validation.errors:
lines.append("\n🔴 ERRORS (must fix before proceeding):")
for err in validation.errors:
lines.append(f" • {err}")
if validation.warnings:
lines.append("\n🟡 WARNINGS (fix if possible):")
for w in validation.warnings:
lines.append(f" • {w}")
lines.append("\nRequired action: Fix ALL errors above. Re-run validation after changes.")
lines.append(f"Attempts remaining: {3 - attempt}")
return "\n".join(lines)
# ─── Offline eval loop (run after session, improve harness) ──────────────
def run_offline_eval(harness_config: dict, eval_suite: list[dict]) -> dict:
"""
Replay agent tasks against a golden eval suite to measure harness quality.
Key metrics: task completion rate, iteration count, token cost, error rate.
Run in CI to catch harness regressions before production deployment.
"""
results = {"passed": 0, "failed": 0, "total_tokens": 0, "avg_iterations": 0}
for case in eval_suite:
outcome = run_agent_loop(case["task"], max_iterations=case.get("max_iter", 15))
passed = evaluate_outcome(outcome, case["expected"])
results["passed" if passed else "failed"] += 1
return results
✓
Write–test–fix loop quality: Agent frameworks reporting 80–90%+ task completion on SWE-bench benchmarks all implement a tight write–test–fix feedback loop. The loop is more valuable than model size. A Claude Sonnet in a well-instrumented feedback harness consistently outperforms Opus running without one on multi-step coding tasks.
You cannot improve what you cannot observe. AI harness observability requires three signal types: distributed traces (for step-by-step action reconstruction), structured metrics (for quantitative performance tracking), and cost attribution (for token and compute accountability). Unlike traditional software, AI observability must also capture reasoning transparency — the decisions the model made and why.
Distributed Traces
Capture every model call, tool invocation, and validation step as a span in a trace. OpenTelemetry is the standard. Each span should record: model, input tokens, output tokens, duration, tool name, inputs, outputs, and the parent span that triggered it.
Session Metrics
Track per-session: total iterations, tool calls by type, error/retry counts, total token cost, task completion status, human intervention count, and time-to-completion. These drive harness tuning over time.
Cost Attribution
Log input tokens, output tokens, model tier, and compute duration for every call. Attribute cost to task, session, user, and team. Cost spikes are the earliest signal of runaway loops or harness misconfiguration.
Structured Session Logging
from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode
import json, time
tracer = trace.get_tracer("ai.harness")
class HarnessObserver:
"""
Wraps every harness action in an OTel span.
Integrates with Jaeger, Honeycomb, Datadog, or any OTel backend.
"""
def trace_model_call(self, session_id: str, messages: list, response) -> None:
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("session.id", session_id)
span.set_attribute("llm.model", response.model)
span.set_attribute("llm.input_tokens", response.usage.input_tokens)
span.set_attribute("llm.output_tokens", response.usage.output_tokens)
span.set_attribute("llm.stop_reason", response.stop_reason)
span.set_attribute("llm.tool_use", response.stop_reason == "tool_use")
# Cost attribution: input=$3/M, output=$15/M for Sonnet
cost_usd = (response.usage.input_tokens * 3 +
response.usage.output_tokens * 15) / 1_000_000
span.set_attribute("llm.cost_usd", cost_usd)
def trace_tool_call(self, tool_name: str, inputs: dict, result: str,
duration_ms: float, error: str = None) -> None:
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.input_size", len(json.dumps(inputs)))
span.set_attribute("tool.duration_ms", duration_ms)
span.set_attribute("tool.success", error is None)
if error:
span.set_status(Status(StatusCode.ERROR, error))
span.record_exception(Exception(error))
def emit_session_summary(self, session: dict) -> None:
"""Structured log line — queryable in any log aggregation platform."""
log_line = {
"event": "harness.session.complete",
"session_id": session["id"],
"task": session["task"][:120],
"completed": session["completed"],
"iterations": session["iterations"],
"tool_calls": session["tool_calls"],
"errors": session["error_count"],
"hitl_events": session["human_interventions"],
"total_tokens": session["input_tokens"] + session["output_tokens"],
"cost_usd": round(session["cost_usd"], 4),
"duration_s": session["duration_s"],
"timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
}
print(json.dumps(log_line)) # structured → Splunk / Datadog / CloudWatch
Key Metrics Dashboard: What to Track
| Metric | Target | Alert Threshold | Diagnosis |
| Task Completion Rate | >85% | <70% | Harness guardrails too restrictive, or model+task mismatch |
| Avg Iterations / Task | <8 | >15 | Feedback loop not informative; tool errors not being surfaced properly |
| Human Intervention Rate | <15% | >40% | Escalation thresholds miscalibrated or task scope too ambiguous |
| Tool Error Rate | <5% | >20% | Tool schema unclear; model misusing tools; infrastructure instability |
| Cost / Task (USD) | Baseline ±20% | >3× baseline | Loop detection failure; context not being compacted; model over-calling tools |
| P99 Latency | <60s / iteration | >120s | Tool timeouts; upstream API degradation; context window size issues |
Human-in-the-loop (HITL) is not a safety afterthought — it is an architectural primitive. The harness defines where and when human judgment is injected into the agent loop. The goal is not maximum oversight but right-placed oversight: humans at the decisions where autonomy risk is highest, automation everywhere else.
Mandatory HITL Gates Always
- Destructive operations (delete, drop table, terminate instance)
- External communications (send email, post to Slack, submit PR)
- Credential or secret access beyond scoped tools
- Changes to CI/CD pipelines or deployment configurations
- Any action affecting more than N files (configurable)
- Loop detection trigger — model stuck in retry cycle
- Unrecoverable errors after max fix attempts
Configurable HITL Thresholds Tunable
- File change scope: ask if modifying >5 files
- Cost gate: pause if session exceeds $X
- Architectural decisions outside CLAUDE.md scope
- Package installation / dependency changes
- Test coverage drops below threshold
- New external network connections introduced
- Off-hours execution of sensitive operations
HITL Approval Flow Implementation
from enum import Enum
import asyncio
class ApprovalOutcome(Enum):
APPROVED = "approved"
REJECTED = "rejected"
MODIFIED = "modified" # human edits the proposed action
ESCALATED = "escalated" # routed to senior approver
class HITLGate:
"""
Pauses agent execution, presents action + context to a human operator,
and resumes (or terminates) based on their decision.
Supports async approvals via Slack, web UI, or CLI.
"""
def __init__(self, notification_backend):
self.backend = notification_backend # Slack, PagerDuty, web webhook
self.pending = {} # request_id → asyncio.Future
async def request_approval(self,
action_type: str,
proposed_action: dict,
context: str,
timeout_s: int = 300) -> ApprovalOutcome:
request_id = generate_id()
future = asyncio.get_event_loop().create_future()
self.pending[request_id] = future
# Notify operator (non-blocking)
await self.backend.notify({
"request_id": request_id,
"action_type": action_type,
"proposed_action": proposed_action,
"context": context[:500],
"approve_url": f"https://harness.internal/approve/{request_id}",
"reject_url": f"https://harness.internal/reject/{request_id}",
})
try:
# Block agent loop until human responds or timeout
result = await asyncio.wait_for(future, timeout=timeout_s)
return result
except asyncio.TimeoutError:
# Timeout policy: default deny for destructive, allow for safe
return ApprovalOutcome.REJECTED # conservative default
finally:
self.pending.pop(request_id, None)
def resolve(self, request_id: str, outcome: ApprovalOutcome) -> None:
"""Called by the approval webhook when human responds."""
if request_id in self.pending:
self.pending[request_id].set_result(outcome)
# ─── Example: Slack notification for HITL approval ───────────────────────
SLACK_HITL_BLOCK = {
"blocks": [
{"type": "header", "text": {"type": "plain_text", "text": "🤖 Agent Action Requires Approval"}},
{"type": "section", "fields": [
{"type": "mrkdwn", "text": "*Action:*\nDelete database migration file"},
{"type": "mrkdwn", "text": "*Risk Level:*\n🔴 HIGH — irreversible"},
]},
{"type": "actions", "elements": [
{"type": "button", "text": {"type": "plain_text", "text": "✅ Approve"}, "style": "primary"},
{"type": "button", "text": {"type": "plain_text", "text": "❌ Reject"}, "style": "danger"},
]},
]
}
Claude Code is Anthropic's terminal-based agentic coding CLI with a built-in harness layer. It exposes three primary harness extension points: hooks (lifecycle callbacks at pre/post-tool execution), CLAUDE.md (session-scoped behavioral constraints), and MCP server integrations (tool capability extensions). Used together, these allow full harness engineering without writing custom agent scaffolding.
// .claude/settings.json — project-level harness configuration
{
"model": "claude-sonnet-4-6",
"env": {
"MAX_THINKING_TOKENS": "10000", // cap extended thinking
"CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "50" // compact at 50% context, not 95%
},
"hooks": {
// PRE-TOOL: validate all bash commands before execution
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [{
"type": "command",
"command": "./harness/validate-command.sh"
// Non-zero exit = BLOCK the command + return error to model
}]
}
],
// POST-TOOL: run linting after every file write
"PostToolUse": [
{
"matcher": "Write",
"hooks": [{
"type": "command",
"command": "./harness/post-write.sh"
// Runs: lint → typecheck → output to model as feedback
}]
}
],
// POST-TURN: log every session turn for observability
"PostTurn": [{
"type": "command",
"command": "./harness/log-turn.sh"
}]
},
"permissions": {
"allow": [
"Bash(npm run *)", // npm scripts: yes
"Bash(git status)", // git read: yes
"Bash(git diff *)",
"Read(src/**)", // src read: yes
"Write(src/**)" // src write: yes
],
"deny": [
"Bash(git push *)", // no remote push
"Bash(rm -rf *)", // no mass delete
"Bash(npm publish)", // no publish
"Bash(curl * | bash)", // no remote execution
"Read(.env*)" // no secrets access
]
}
}
#!/bin/bash
# harness/validate-command.sh
# Claude Code passes tool input via stdin as JSON.
# Exit 0 = allow. Exit 1 = BLOCK. Exit 2 = BLOCK + return message to model.
INPUT=$(cat) # Read JSON from stdin: {"command": "...", "description": "..."}
CMD=$(echo $INPUT | jq -r .command)
# Block patterns — never allow these regardless of context
BLOCKED=(
"rm -rf /"
"dd if=/"
"mkfs"
"chmod 777"
":(){:|:&};:" # fork bomb
"curl.*|.*bash"
"wget.*|.*sh"
)
for pattern in "${BLOCKED[@]}"; do
if echo "$CMD" | grep -qE "$pattern"; then
echo "BLOCKED: Command matches prohibited pattern: $pattern" >&2
exit 2 # Exit 2 = message returned to model
fi
done
# Log all commands for audit trail
echo "$(date -u +"%Y-%m-%dT%H:%M:%SZ") CMD: $CMD" >> logs/harness-audit.log
exit 0 # Allow command to proceed
⚠
MCP server risk: Don't enable all MCP servers at once. Each MCP server expands the agent's tool surface. Run npx ecc-agentshield scan (AgentShield, 2026) to audit your MCP config, hooks, and CLAUDE.md for injection risks, permission over-grants, and secret exposure patterns before production use.
GitHub Copilot and OpenAI Codex CLI each provide harness configuration primitives that parallel Claude Code's CLAUDE.md/hooks model. Copilot uses repository instruction files and organization-level policy. Codex uses AGENTS.md and sandbox permissions. Understanding the harness surface of each tool lets you apply the same disciplined engineering regardless of which agent is running.
AGENTS.md — Codex Harness Config
# AGENTS.md — OpenAI Codex Harness Configuration
# Equivalent to CLAUDE.md for the Codex CLI environment
## Project
Python FastAPI service. Tests: pytest. Linting: ruff. Type checking: mypy.
## Commands
# These are the ONLY commands the agent should run unsupervised
- Run tests: pytest tests/ -v --tb=short
- Lint: ruff check . --fix
- Type check: mypy src/ --strict
- Format: ruff format .
## Sandbox Permissions (network isolation)
# Codex runs in a network-isolated sandbox by default
network: disabled # prevent exfiltration, force offline operation
## Required Workflow
For EVERY code change, in this order:
1. Make targeted, minimal changes
2. Run mypy — zero new errors
3. Run ruff — auto-fix, then zero remaining issues
4. Run pytest — all existing tests must pass
5. Add tests for new behavior (minimum 1 test per new function)
## Boundaries
- Modify ONLY files under src/ and tests/
- Do NOT modify pyproject.toml or requirements.txt without asking
- Do NOT create new API endpoints without showing the design first
- Do NOT use subprocess in application code
## On Failure
If any check fails after 2 attempts: output the exact error and ask for guidance.
Do NOT try a third fix attempt independently.
Orchestration frameworks provide composable harness primitives — prompt templates, tool definitions, memory abstractions, and agent loop logic — so teams don't build the execution layer from scratch. Choice of framework shapes harness architecture. Pick based on production requirements, not demo simplicity.
| Framework | Model | Best For | Harness Maturity | Production Uses |
| LangGraph |
Graph / stateful |
Complex multi-step workflows with branching and cycles |
HIGH |
Klarna, Replit, Elastic |
| LangChain |
Chain / pipeline |
Rapid prototyping, broad integrations (100k+ stars) |
MEDIUM |
Widely deployed, vary in rigor |
| CrewAI |
Multi-agent crews |
Role-based agent teams with task delegation |
MEDIUM |
Enterprise automation pilots |
| AWS Bedrock AgentCore |
Runtime + gateway |
Full production stack: runtime, memory, identity, observability |
HIGH |
AWS-native enterprise |
| Custom + Anthropic SDK |
Direct API |
Maximum control, minimum abstraction overhead |
MAX (manual) |
High-reliability production systems |
▸
LangGraph v1.0 (October 2025): Graph-based stateful orchestration for complex agent workflows. Nodes are agent steps; edges are conditional routing logic; state is a typed dict persisted across the graph. First-class support for human-in-the-loop checkpoints, streaming, and tool-use error recovery. The reference framework for production multi-agent harnesses as of 2026.
The Model Context Protocol (MCP) is the emerging standard for tool interoperability — any MCP-compliant client (Claude Code, Copilot, LangChain) can consume any MCP-compliant tool server without bespoke integration. The Agent-to-Agent (A2A) protocol extends this to multi-agent topologies: orchestrator agents delegate to specialist subagents through a standardized calling convention.
Model Context Protocol (MCP) Standard
Defines the interface between model clients and tool servers. Tool servers expose: capabilities manifest, tool schemas, and a call endpoint. Clients invoke tools by name with validated inputs. The harness governs which MCP servers are loaded per session and what their permission scopes are.
Supported by: Claude Code, Copilot Agent HQ, LangChain, LangGraph, Cursor, and 200+ community servers.
Agent-to-Agent (A2A) 2026
Standardized protocol for orchestrator→subagent delegation. An orchestrator agent can spawn a specialist subagent (code reviewer, security scanner, documentation writer) as a structured API call. A2A ensures: typed task contracts, result schemas, error propagation, and cost attribution across agent boundaries.
Reference: IEEE CAI 2026 Tutorial on Engineering Trustworthy Multi-Agent Systems.
MCP Harness Configuration Example
// Only include servers needed for the current project.
// Each server = expanded attack surface. Audit with ecc-agentshield.
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "./src"],
// Scoped to ./src only — not the full filesystem
"description": "Read/write access to project source files only"
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "${GITHUB_TOKEN}"
// Never hardcode tokens — use env var references
}
},
"postgres": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-postgres",
"postgresql://readonly_user@localhost/mydb"],
// READ-ONLY database user — never give agents write access to prod DB
"description": "Read-only access to development database schema and data"
}
// Intentionally excluded: shell execution, network access, cloud control plane
}
}
Harness engineering maturity maps to the CISA ZTA model structure: four stages across five dimensions. Teams rarely advance all dimensions simultaneously. Start with guardrails and feedback loops (highest safety ROI), then add observability and HITL rigor, then optimize orchestration.
Stage 01
Ad Hoc
- No config files
- No permission scoping
- No feedback loop
- Manual testing only
- No cost visibility
- All-or-nothing iteration
Stage 02
Structured
- CLAUDE.md / AGENTS.md present
- Basic allow/deny rules
- Test feedback loop
- Manual HITL checkpoints
- Basic cost logging
- Iteration cap set
Stage 03
Production
- Multi-layer guardrails
- Structured error recovery
- OTel traces + metrics
- Automated HITL gates
- Offline eval suite (CI)
- MCP scoped per project
Stage 04
Optimal
- Policy-as-code guardrails
- Dynamic HITL thresholds
- Full A2A + orchestration
- Continuous harness eval
- Cost attribution per team
- Security scan in CI/CD
- Unbounded iteration — no max_iterations cap; agent burns tokens until OOM or API limit
- Silent errors — tool returns error as empty string; model assumes success
- Broad file access — agent has write access to entire repo including secrets, CI config
- No state checkpoints — 20-iteration task has no recovery point on failure
- Prompt-only guardrails — "don't touch .env" in the prompt; not enforced deterministically
- No loop detection — same tool called 8× with identical inputs before token exhaustion
- All MCPs enabled — every integration loaded regardless of current task
- Human gate theater — HITL approval always auto-approves due to timeout policy
- Bounded loops — hard iteration cap; structured handoff artifact on max-iter
- Structured tool errors — every error has type, detail, and suggested recovery hint
- Minimal tool surface — 3–5 scoped tools per session, not a full tool registry
- Explicit checkpointing — write progress artifact to disk every N iterations
- Deterministic guardrails — hook scripts enforce rules at OS level, not prompt level
- Loop detection — circuit breaker triggers HITL on 3× same tool+input
- Task-scoped MCP — only load servers required for the current task category
- Tiered HITL — auto-approve low-risk, manual-approve destructive, reject on timeout
⚠
Prompt injection via tools: When an agent reads external content (files, web pages, emails, database rows) and that content contains instructions ("Ignore previous instructions and..."), indirect prompt injection attacks can hijack the agent loop. Defense: treat all external tool output as untrusted data, not as instructions. Sanitize tool outputs before injection into the conversation; use a dedicated trusted/untrusted content zone in the context window.
Whether you're starting from a bare API call or retrofitting an existing agent, the path to production harness engineering follows the same six steps. Each step independently improves reliability. The order matters: guardrails before observability, feedback loop before HITL tuning.
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 1 — ASSET INVENTORY (Day 1–2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✦ List every tool the agent currently uses or could use
✦ Classify each tool: read-only / write / destructive
✦ Identify data the agent can touch (files, DBs, APIs, credentials)
✦ Map current failure modes: what breaks, how often, impact
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 2 — CONFIGURATION FILE (Day 3–4)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✦ Create CLAUDE.md / AGENTS.md with:
- Explicit allowed operations
- Explicit prohibited operations
- Required verification steps after any change
- Escalation triggers (when to stop and ask)
✦ Set iteration cap (start: 15 for coding, 30 for research)
✦ Define context compaction strategy (compact at 50%, not 95%)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 3 — FEEDBACK LOOP (Week 1) ROI: HIGHEST
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✦ Implement write-test-fix cycle:
- Agent writes → tests run automatically → failures injected as next input
✦ Add output validation: secret detection + static analysis + lint
✦ Structure error messages: type + detail + recovery hint
✦ Set max fix attempts (2–3) before escalating to human
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 4 — GUARDRAILS (Week 2)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✦ Add pre-tool hook: validate commands before execution
✦ Scope file/directory access to minimum needed
✦ Add loop detection (3× same tool+input = escalate)
✦ Set token budget + cost alert threshold
✦ Define HITL gates for all destructive operations
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 5 — OBSERVABILITY (Week 3)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✦ Instrument all model calls with OTel spans
✦ Log: every tool call, duration, error rate, cost
✦ Emit structured session summary on completion
✦ Set up dashboard: completion rate, avg iterations, cost/task
✦ Add post-turn hook for audit trail
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
STEP 6 — EVALUATE AND ITERATE (Ongoing)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✦ Build an offline eval suite (10–20 representative tasks)
✦ Run eval in CI on every harness config change
✦ Track: completion rate, cost/task, iteration count, error rate
✦ Review HITL logs weekly — tune thresholds based on outcomes
✦ Scan harness config for security issues (ecc-agentshield)
✦ Update CLAUDE.md/AGENTS.md as project evolves
Common Failure Modes Avoid
- Starting with orchestration before feedback loop is solid
- Treating CLAUDE.md as documentation, not a contract
- Giving the same agent all MCP servers for all tasks
- HITL gates that always auto-approve due to short timeouts
- No eval suite — can't tell if harness changes help or hurt
- Measuring only task success rate, not cost efficiency
Success Indicators Target
- Task completion rate >85% without human intervention
- Agent self-corrects test failures in <3 iterations
- No secret or PII ever appears in agent output
- Cost per task stable within ±20% across model upgrades
- Full trace replay possible for any failed session
- Harness config checked into source control, reviewed in PR
▸
Reference implementations and resources: NIST AI Agent Standards Initiative (Feb 2026) · OWASP LLM Top 10 (2025) — especially LLM01 Prompt Injection and LLM06 Excessive Agency · IEEE CAI 2026 — Engineering Trustworthy Multi-Agent Systems · everything-claude-code (Anthropic Hackathon Winner, 140k⭐) — battle-tested hooks, skills, MCP configs · ecc-agentshield — harness security scanner · AWS AgentCore Samples — full production harness reference across multiple frameworks