LangSmith Handbook
A production-oriented reference for LangSmith tracing, datasets, evaluation, feedback loops, and prompt management across LangChain and framework-agnostic Python applications.
Table of Contents
This release includes the complete handbook: setup, tracing, datasets, evaluation, metadata, feedback workflows, and prompt management. The structure stays aligned with the reference handbook while the examples stay current with the modern LangSmith SDK.
- 1 Module 1: Setup & Core Client
- 2 Module 2: Tracing & Observability
- 3 Module 3: Datasets & Test Cases
- 4 Module 4: Evaluation (Offline Testing)
- 5 Module 5: Metadata, Tags, & User Feedback
- 6 Module 6: The LangChain Prompt Hub
langsmith SDK throughout.Module 1: Setup & Core Client
LangSmith tracing can be turned on with a minimal environment contract and then accessed programmatically through the Client. In practice, this means you can start with environment-only tracing for fast adoption, then graduate to the SDK for datasets, evaluation, automation, and production feedback workflows.
Required variables: LANGSMITH_TRACING, LANGSMITH_API_KEY, and LANGSMITH_PROJECT. These are the core switches that enable tracing, authenticate your process, and group runs under a project.
| Variable | Purpose | Example Value |
|---|---|---|
LANGSMITH_TRACING | Turns tracing on for supported integrations and SDK helpers. | true |
LANGSMITH_API_KEY | Authenticates requests to LangSmith. | lsv2_pt_... |
LANGSMITH_PROJECT | Sets the logical project name shown in the UI. | customer-support-prod |
LANGSMITH_WORKSPACE_ID if your API key can access multiple workspaces, and set LANGSMITH_ENDPOINT if you use a self-hosted or regional deployment.Recommended Local Setup Script
The snippet below is Python instead of shell so it can be pasted directly into a local bootstrap script, test harness, notebook, or demo app. In production, prefer a secrets manager or runtime environment injection rather than hardcoding values.
import os
def configure_langsmith_environment() -> None:
"""Set the minimum LangSmith environment required for tracing.
This is convenient for local demos and reproducible examples.
In production, use your platform's secret manager or environment injection.
"""
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_API_KEY"] = "YOUR_LANGSMITH_API_KEY"
os.environ["LANGSMITH_PROJECT"] = "langsmith-handbook-dev"
# Optional when one API key can access more than one workspace.
# os.environ["LANGSMITH_WORKSPACE_ID"] = "YOUR_WORKSPACE_ID"
# Optional for EU, self-hosted, or custom LangSmith deployments.
# os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
def validate_langsmith_environment() -> None:
"""Fail early if a required variable is missing.
Explicit validation is useful in CI, worker boot, and local smoke tests.
"""
required_keys = (
"LANGSMITH_TRACING",
"LANGSMITH_API_KEY",
"LANGSMITH_PROJECT",
)
missing = [key for key in required_keys if not os.getenv(key)]
if missing:
raise RuntimeError(f"Missing LangSmith configuration: {', '.join(missing)}")
if __name__ == "__main__":
configure_langsmith_environment()
validate_langsmith_environment()
print("LangSmith environment is configured correctly.")
What This Enables
- LangChain auto-tracing: supported LangChain executions will log to LangSmith once tracing is enabled.
- Vanilla Python instrumentation: the same project and credentials are reused by
@traceable, wrappers, evaluation helpers, and the API client. - Consistent project boundaries: all runs from the process are grouped under the configured project unless overridden dynamically.
Standard: initialize Client from the langsmith package when you need to create datasets, examples, feedback, evaluations, or interact with LangSmith programmatically. The client can read credentials from environment variables or be configured explicitly.
Environment-Driven Client
This is the preferred default for application code because it keeps secrets out of source and aligns with standard deployment workflows.
import os
from langsmith import Client
def build_langsmith_client() -> Client:
"""Create a LangSmith client using environment-based configuration.
The Client automatically reads LANGSMITH_API_KEY and related settings,
so this version works well in apps, jobs, and CI pipelines.
"""
if not os.getenv("LANGSMITH_API_KEY"):
raise RuntimeError("LANGSMITH_API_KEY must be set before creating Client().")
return Client()
if __name__ == "__main__":
client = build_langsmith_client()
print(f"LangSmith client initialized: {client.__class__.__name__}")
Explicit Client Configuration
Use explicit configuration when environment variables are not available, when you are bootstrapping tracing programmatically, or when you need to target a custom LangSmith endpoint.
from langsmith import Client
def build_explicit_langsmith_client() -> Client:
"""Create a LangSmith client with explicit API settings.
This pattern is useful when credentials come from a secret manager,
a deployment platform, or a custom configuration service.
"""
return Client(
api_key="YOUR_LANGSMITH_API_KEY",
api_url="https://api.smith.langchain.com",
)
if __name__ == "__main__":
client = build_explicit_langsmith_client()
print("Explicit LangSmith client created successfully.")
Minimal Programmatic API Example
The example below proves the client is more than a tracing helper. It is the entry point for platform automation and should be treated like a first-class SDK object in your MLOps codebase.
from typing import Iterable
from langsmith import Client
def list_project_names(limit: int = 5) -> list[str]:
"""Return a small sample of project names visible to the client.
Reading back projects is a simple smoke test that confirms the client can
authenticate and talk to the LangSmith API successfully.
"""
client = Client()
projects: Iterable = client.list_projects(limit=limit)
return [project.name for project in projects]
if __name__ == "__main__":
names = list_project_names(limit=5)
print("Visible LangSmith projects:")
for name in names:
print(f"- {name}")
Client(...) initialization only when you intentionally need to override the runtime environment.Module 1 Summary
- Tracing baseline: set
LANGSMITH_TRACING,LANGSMITH_API_KEY, andLANGSMITH_PROJECT. - Programmatic access: use
from langsmith import Clientfor datasets, evaluation, feedback, and platform automation. - Deployment guidance: prefer environment variables in production; use explicit client parameters when you need custom endpoints or runtime-only secret retrieval.
Module 2: Tracing & Observability
LangSmith tracing works in two complementary modes. With LangChain, tracing can be almost automatic once the environment is configured. Without LangChain, the langsmith SDK gives you explicit instrumentation primitives such as @traceable, the trace context manager, and client wrappers for provider SDKs like OpenAI.
Standard behavior: once LANGSMITH_TRACING=true and your credentials are configured, LangChain LCEL chains automatically emit traces to LangSmith. No extra tracing code is required for the basic path.
import os
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
def run_langchain_autotraced_example() -> str:
"""Run a simple LCEL chain that is auto-traced by LangSmith.
As long as LANGSMITH_TRACING, LANGSMITH_API_KEY, and LANGSMITH_PROJECT are
configured in the environment, this invocation will appear in LangSmith.
"""
required_keys = (
"LANGSMITH_TRACING",
"LANGSMITH_API_KEY",
"LANGSMITH_PROJECT",
"OPENAI_API_KEY",
)
missing = [key for key in required_keys if not os.getenv(key)]
if missing:
raise RuntimeError(f"Missing configuration: {', '.join(missing)}")
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a helpful assistant. Answer only from the given context.",
),
("user", "Question: {question}\nContext: {context}"),
]
)
model = ChatOpenAI(model="gpt-4.1-mini")
parser = StrOutputParser()
chain = prompt | model | parser
return chain.invoke(
{
"question": "What happened in this morning's meeting?",
"context": "The team finalized the migration plan and assigned owners.",
}
)
if __name__ == "__main__":
print(run_langchain_autotraced_example())
Recommended SDK primitive: use the @traceable decorator to mark custom functions as LangSmith runs. Choose a run_type such as chain, tool, or llm so the trace renders correctly in the LangSmith UI.
from openai import OpenAI
from langsmith import traceable
openai_client = OpenAI()
@traceable(run_type="tool", name="Build Context")
def build_context(question: str) -> str:
"""Return retrieval context as a traceable tool step.
Marking this as a tool keeps the trace tree semantically meaningful.
"""
if "meeting" in question.lower():
return "During the meeting, the migration was approved and owners were assigned."
return "No relevant context was found."
@traceable(run_type="llm", name="OpenAI Summarizer")
def call_llm(messages: list[dict[str, str]]) -> str:
"""Call the LLM and record the step as an LLM run.
Using run_type='llm' helps LangSmith render token, latency, and model data
appropriately for this node in the trace tree.
"""
response = openai_client.chat.completions.create(
model="gpt-4.1-mini",
temperature=0,
messages=messages,
)
return response.choices[0].message.content or ""
@traceable(run_type="chain", name="Summarize Question")
def summarize_question(question: str) -> str:
"""Compose a small pipeline using traceable child steps.
LangSmith automatically nests traceable calls made inside this function.
"""
context = build_context(question)
messages = [
{
"role": "system",
"content": "You are a helpful assistant. Answer only from the provided context.",
},
{
"role": "user",
"content": f"Question: {question}\nContext: {context}",
},
]
return call_llm(messages)
if __name__ == "__main__":
answer = summarize_question("Can you summarize the meeting?")
print(answer)
run_type="chain", helper retrieval or enrichment functions run_type="tool", and model calls run_type="llm". This makes the trace readable for humans and evaluators.Best use case: if you already call the native OpenAI SDK directly, wrap the client once with langsmith.wrappers.wrap_openai. This preserves your existing code style while automatically tracing chat and responses API calls.
import openai
from langsmith.wrappers import wrap_openai
def run_wrapped_openai_example() -> str:
"""Trace a native OpenAI SDK call without switching to LangChain.
The wrapped client behaves like the normal OpenAI client, but LangSmith
automatically captures the request and response as a traced run.
"""
client = wrap_openai(openai.OpenAI())
messages = [
{"role": "system", "content": "You are a concise assistant."},
{
"role": "user",
"content": "List three reasons why observability matters in LLM systems.",
},
]
completion = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
return completion.choices[0].message.content or ""
if __name__ == "__main__":
print(run_wrapped_openai_example())
import openai
from langsmith.wrappers import wrap_openai
def run_wrapped_openai_responses_api() -> str:
"""Trace the OpenAI Responses API through the same wrapped client.
This is useful when your application uses the newer OpenAI responses style
instead of chat.completions.
"""
client = wrap_openai(openai.OpenAI())
response = client.responses.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is LangSmith used for?"},
],
)
return response.output_text
if __name__ == "__main__":
print(run_wrapped_openai_responses_api())
Module 3: Datasets & Test Cases
Datasets are the backbone of repeatable evaluation in LangSmith. They let you store canonical inputs, expected outputs, metadata, and trace-linked examples so offline testing and regression analysis can run against a stable benchmark instead of ad hoc prompts.
Standard: create datasets programmatically so they can be versioned, recreated in CI, and managed as part of your evaluation pipeline. For most app testing, the default kv data type is appropriate.
from langsmith import Client
from langsmith.schemas import DataType
def create_langsmith_dataset() -> str:
"""Create a dataset for offline QA and return its ID.
Defining the schemas up front helps other engineers understand exactly what
each example in the dataset is expected to contain.
"""
client = Client()
dataset = client.create_dataset(
dataset_name="support-agent-regression-suite",
description="Regression test cases for customer support answer quality.",
data_type=DataType.kv,
inputs_schema={
"type": "object",
"properties": {
"question": {"type": "string"},
"context": {"type": "string"},
},
"required": ["question", "context"],
},
outputs_schema={
"type": "object",
"properties": {
"answer": {"type": "string"},
},
"required": ["answer"],
},
metadata={"owner": "ml-platform", "use_case": "offline-evaluation"},
)
return str(dataset.id)
if __name__ == "__main__":
print(create_langsmith_dataset())
Preferred bulk pattern: use create_examples(..., examples=[...]) with a single list of example objects. This is the current recommended API instead of older split-argument upload patterns.
from typing import Any
from langsmith import Client
def populate_dataset_examples() -> dict[str, Any]:
"""Populate a dataset with question and answer examples.
The examples list is the modern API because each row stays self-contained,
making the code easier to review and update over time.
"""
client = Client()
examples = [
{
"inputs": {
"question": "What happened in the morning migration meeting?",
"context": "The migration plan was approved and owners were assigned.",
},
"outputs": {
"answer": "The migration plan was approved and owners were assigned.",
},
"metadata": {"difficulty": "easy", "topic": "meetings"},
"splits": ["train"],
},
{
"inputs": {
"question": "Who owns the migration follow-up work?",
"context": "Ava owns rollout coordination and Liam owns validation.",
},
"outputs": {
"answer": "Ava owns rollout coordination and Liam owns validation.",
},
"metadata": {"difficulty": "medium", "topic": "ownership"},
"splits": ["test"],
},
]
return client.create_examples(
dataset_name="support-agent-regression-suite",
examples=examples,
)
if __name__ == "__main__":
response = populate_dataset_examples()
print(response)
train, test, or validation to organize how the dataset is consumed later by evaluators.Production feedback loop: promote successful production runs into datasets so you can turn real traffic into regression test cases. This is one of the strongest LangSmith workflows because it connects observability directly to evaluation.
from langsmith import Client
def create_example_from_known_run(run_id: str) -> str:
"""Promote an existing production run into a dataset example.
This is useful when a real user interaction becomes a strong regression test
case that you want to preserve for future evaluation runs.
"""
client = Client()
run = client.read_run(run_id)
example = client.create_example_from_run(
run,
dataset_name="support-agent-regression-suite",
)
return str(example.id)
if __name__ == "__main__":
print(create_example_from_known_run("YOUR_RUN_ID"))
from typing import Iterable
from langsmith import Client
def backfill_examples_from_recent_production_runs(limit: int = 3) -> list[str]:
"""Create dataset examples from recent production runs.
This pattern is useful for curating a regression suite from high-value or
representative real-world traffic after manual review.
"""
client = Client()
recent_runs: Iterable = client.list_runs(
project_name="customer-support-prod",
limit=limit,
)
created_example_ids: list[str] = []
for run in recent_runs:
# Only promote successful runs that contain both inputs and outputs.
if not getattr(run, "outputs", None):
continue
example = client.create_example_from_run(
run,
dataset_name="support-agent-regression-suite",
)
created_example_ids.append(str(example.id))
return created_example_ids
if __name__ == "__main__":
print(backfill_examples_from_recent_production_runs())
Module 3 Summary
- Create datasets programmatically: use
client.create_dataset(...)so evaluation infrastructure can be reproduced reliably. - Use bulk example creation: prefer
client.create_examples(..., examples=[...])for clean, modern dataset population. - Promote real traces into test cases: use
client.create_example_from_run(...)to turn production observations into durable regression coverage.
Module 4: Evaluation (Offline Testing)
Tracing explains what happened in production. Evaluation tells you whether the system is getting better. In LangSmith, the usual pattern is to define a target application function, point it at a dataset, and run one or more evaluators that score quality dimensions such as correctness, helpfulness, retrieval quality, or policy compliance.
Core workflow: give LangSmith a target callable, a dataset or iterable of examples, and a list of evaluators. LangSmith will execute the target over the dataset, record the experiment, and store evaluator outputs alongside traces for later comparison.
from langsmith import traceable
from langsmith.evaluation import evaluate
@traceable(name="support_agent", run_type="chain")
def support_agent(inputs: dict) -> dict:
question = inputs["question"]
if "refund" in question.lower():
return {"answer": "Refunds can be requested within 30 days of purchase."}
return {"answer": "Please contact support@example.com for account help."}
def exact_match(outputs: dict, reference_outputs: dict) -> dict:
predicted = outputs.get("answer", "").strip().lower()
expected = reference_outputs.get("answer", "").strip().lower()
return {
"key": "exact_match",
"score": 1 if predicted == expected else 0,
}
if __name__ == "__main__":
experiment_results = evaluate(
target=support_agent,
data="support-agent-regression-suite",
evaluators=[exact_match],
experiment_prefix="support-agent-offline",
description="Baseline regression sweep for the current support workflow.",
max_concurrency=4,
)
print(experiment_results)
{"answer": ...} or {"answer": ..., "citations": ...}.Use when you can define a rule: exact match, keyword presence, citation coverage, JSON schema validity, or refusal-policy checks are usually better handled by deterministic code before you reach for a judge model.
import re
def contains_order_number(outputs: dict) -> dict:
answer = outputs.get("answer", "")
has_order_number = bool(re.search(r"ORD-\d{6}", answer))
return {
"key": "contains_order_number",
"score": 1 if has_order_number else 0,
"comment": "Checks whether the answer includes a formatted order number.",
}
def mentions_refund_window(outputs: dict, reference_outputs: dict) -> dict:
answer = outputs.get("answer", "").lower()
expected_phrase = reference_outputs.get("required_phrase", "").lower()
return {
"key": "mentions_refund_window",
"score": 1 if expected_phrase and expected_phrase in answer else 0,
}
Use when the quality bar is semantic: helpfulness, tone, groundedness, completeness, and rubric-based scoring often need language-model judgment. Keep the rubric explicit and return a structured score plus comment.
import json
from openai import OpenAI
from langsmith.wrappers import wrap_openai
judge_client = wrap_openai(OpenAI())
def helpfulness_judge(inputs: dict, outputs: dict, reference_outputs: dict) -> dict:
rubric = """
Score the assistant response from 0 to 1.
1.0 = fully correct and directly helpful
0.5 = partially helpful or incomplete
0.0 = incorrect, misleading, or irrelevant
Return strict JSON with keys: score, reasoning.
""".strip()
response = judge_client.responses.create(
model="gpt-4.1-mini",
input=[
{
"role": "system",
"content": rubric,
},
{
"role": "user",
"content": json.dumps(
{
"question": inputs.get("question"),
"assistant_answer": outputs.get("answer"),
"reference_answer": reference_outputs.get("answer"),
}
),
},
],
)
payload = json.loads(response.output_text)
return {
"key": "helpfulness_judge",
"score": float(payload["score"]),
"comment": payload["reasoning"],
}
Module 4 Summary
- Use
evaluate()for repeatable experiments: target plus dataset plus evaluators is the core offline testing loop. - Prefer deterministic evaluators where possible: they are cheaper, clearer, and easier to debug.
- Add model-based judges only for semantic criteria: rubric-driven grading works best when you require nuanced quality assessment.
Module 5: Metadata, Tags, & User Feedback
Production observability becomes useful when traces carry business context. Metadata and tags let you slice runs by tenant, feature flag, model version, channel, or experiment branch. The feedback API then closes the loop by attaching human or automated judgments directly to a run or trace.
Use inside traced execution: fetch the current run tree, then enrich it with details that matter later during filtering or root-cause analysis. This is where you annotate traces with the exact context that product teams care about.
from langsmith import get_current_run_tree, traceable
@traceable(name="answer_support_question", run_type="chain", tags=["support"])
def answer_support_question(question: str, customer_tier: str, channel: str) -> dict:
run_tree = get_current_run_tree()
if run_tree is not None:
run_tree.metadata["customer_tier"] = customer_tier
run_tree.metadata["channel"] = channel
run_tree.metadata["workflow_version"] = "2025-03-router-a"
run_tree.tags.extend([
f"tier:{customer_tier}",
f"channel:{channel}",
"experience:support",
])
return {
"answer": f"Handled question '{question}' for {customer_tier} tier customer."
}
from langsmith import trace
def sync_customer_profile(customer_id: str) -> dict:
with trace(
"sync_customer_profile",
run_type="tool",
tags=["crm", "profile-sync"],
metadata={"customer_id": customer_id, "source": "nightly-job"},
):
return {"status": "ok", "customer_id": customer_id}
Use after a run completes: attach a score, value, and comment to the run that produced an answer. This lets you build dashboards around thumbs-up rates, agent defects, human-review outcomes, or safety audits.
from langsmith import Client
def record_user_feedback(
run_id: str,
trace_id: str,
was_helpful: bool,
comment: str | None = None,
) -> None:
client = Client()
client.create_feedback(
run_id=run_id,
trace_id=trace_id,
key="user_helpfulness",
score=1.0 if was_helpful else 0.0,
value={"label": "thumbs_up" if was_helpful else "thumbs_down"},
comment=comment,
)
from langsmith import Client, get_current_run_tree, traceable
client = Client()
@traceable(name="moderated_answer", run_type="chain")
def moderated_answer(question: str) -> dict:
run_tree = get_current_run_tree()
answer = {"answer": "Please reset your password from the account settings page."}
if run_tree is not None:
client.create_feedback(
run_id=run_tree.id,
trace_id=run_tree.trace_id,
key="policy_review",
score=1.0,
comment="Passed automated policy checks.",
value={"reviewer": "automated-guardrail"},
)
return answer
| Signal | Where To Store It | Typical Example |
|---|---|---|
| Customer tier | Metadata | {"customer_tier": "enterprise"} |
| Broad grouping | Tags | ["support", "web-chat"] |
| User judgment | Feedback | score=1.0, key="user_helpfulness" |
Module 5 Summary
- Annotate runs with business context: metadata and tags make traces searchable and operationally useful.
- Use
get_current_run_tree()inside traced code: it is the clean way to enrich the active run. - Attach explicit feedback to runs:
client.create_feedback(...)turns production judgments into durable quality signals.
Module 6: The LangChain Prompt Hub
The modern Prompt Hub is backed by LangSmith. You can pull published prompts directly into code, pin a specific prompt revision for reproducibility, and push private prompt templates from local development into your workspace. The most practical approach is to use the LangChain helper functions when you are already inside a LangChain application, and the LangSmith client directly when you want tighter SDK control.
Use for prompt reuse and pinning: pull a community prompt, your own private prompt, or a specific revision so evaluations and deployments stay reproducible.
import os
from langchain_classic import hub
public_prompt = hub.pull("efriis/my-first-prompt")
private_prompt = hub.pull(
"support-agent-router",
api_key=os.environ["LANGSMITH_API_KEY"],
)
pinned_prompt = hub.pull(
"acme/support-agent-router:YOUR_COMMIT_HASH",
api_key=os.environ["LANGSMITH_API_KEY"],
)
print(type(public_prompt))
print(type(private_prompt))
print(type(pinned_prompt))
owner/prompt_name, owner/prompt_name:commit_hash, or just prompt_name when you are addressing your own private prompt repository.Use when a local prompt becomes shared infrastructure: push a prompt template into the Hub so application code, evaluations, and teammates can all reference the same artifact.
import os
from langchain_core.prompts import ChatPromptTemplate
from langchain_classic import hub
router_prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a support router. Classify tickets into billing, account, or technical support.",
),
("user", "{question}"),
]
)
hub_url = hub.push(
"support-agent-router",
router_prompt,
api_key=os.environ["LANGSMITH_API_KEY"],
new_repo_is_public=False,
new_repo_description="Internal router prompt for support ticket triage.",
readme="# Support Agent Router\n\nInternal prompt used by the production support workflow.",
tags=["support", "router", "internal"],
)
print(hub_url)
Use when you want SDK-native control: the LangSmith client exposes prompt APIs directly, including flags like include_model, tags, description, and repository visibility.
from langchain_core.prompts import ChatPromptTemplate
from langsmith import Client
client = Client()
prompt = client.pull_prompt("support-agent-router", include_model=False)
published_url = client.push_prompt(
"support-agent-router-v2",
object=ChatPromptTemplate.from_messages(
[
("system", "Answer support questions using policy-approved wording only."),
("user", "{question}"),
]
),
is_public=False,
description="Second-generation prompt with stricter policy wording.",
tags=["support", "policy", "draft"],
)
print(prompt)
print(published_url)
Module 6 Summary
- Pull prompts directly into application code: use the Hub for reuse, versioning, and reproducible experiments.
- Push shared prompt templates into private repositories: this turns prompt changes into managed artifacts instead of local string edits.
- Use LangChain helpers or the LangSmith client depending on context: both paths target the same Prompt Hub capabilities.