Weights & Biases (W&B) Enterprise Handbook
A production-ready guide to experiment tracking, media logging, sweeps, artifacts, framework integrations, and collaboration patterns that keep ML teams reproducible and out of spreadsheet chaos.
Module 1: The Foundation (Tracking Experiments)
Experiment tracking is the first layer of MLOps discipline. Without it, teams end up with model names like model_v4_final_really_final.pt and no reliable way to reproduce results. W&B gives every run an identity, a config, a metric history, and a home in the UI.
A run becomes reproducible when you can recover the hyperparameters, the code version, and the exact dataset or model artifact it used. W&B is valuable because it ties those pieces together.
The Setup
Never hardcode a W&B API key into source code. Treat it like any other credential. The secure pattern is to store the key in an environment variable and let wandb login or wandb.login() read it from the environment. That keeps secrets out of repos and CI logs.
# PowerShell $env:WANDB_API_KEY="your_api_key_here" wandb login # Bash export WANDB_API_KEY="your_api_key_here" wandb login
from __future__ import annotations
import os
import wandb
def login_to_wandb() -> None:
api_key: str | None = os.getenv("WANDB_API_KEY")
if not api_key:
raise RuntimeError("WANDB_API_KEY is not set in the environment.")
# Read the key from the environment instead of hardcoding secrets in source.
wandb.login(key=api_key, relogin=True)
if __name__ == "__main__":
login_to_wandb()
Init & Config
wandb.init() starts a tracked run. The config dictionary is critical because it turns hyperparameters into filterable metadata in the W&B UI. That means you can group runs by learning rate, batch size, optimizer, or architecture instead of reverse-engineering those choices from notes later.
from __future__ import annotations
import wandb
run = wandb.init(
project="my-project",
job_type="train",
group="resnet50-baselines",
tags=["baseline", "image-classification"],
config={
"learning_rate": 1e-3,
"batch_size": 64,
"epochs": 10,
"optimizer": "adamw",
"model_name": "resnet50",
"dataset_name": "imagenette",
},
)
# Access config values the same way everywhere in your training code.
learning_rate: float = wandb.config.learning_rate
batch_size: int = wandb.config.batch_size
print(f"Tracking run: {run.name}")
The Training Loop
The right pattern is to log metrics at meaningful checkpoints, not mindlessly on every inner-loop operation. A good mental model is that W&B is your experiment journal. You write entries often enough to reconstruct the story, but not so often that journaling becomes the workload.
from __future__ import annotations
import math
import random
from typing import Final
import wandb
TOTAL_EPOCHS: Final[int] = 5
LOG_EVERY_N_STEPS: Final[int] = 20
def train() -> None:
run = wandb.init(
project="my-project",
config={
"epochs": TOTAL_EPOCHS,
"learning_rate": 1e-3,
"model_name": "toy-classifier",
},
)
try:
global_step: int = 0
for epoch in range(TOTAL_EPOCHS):
running_loss: float = 0.0
running_accuracy: float = 0.0
for step in range(100):
# Simulate a real training step. Replace with your model code.
loss: float = max(0.05, math.exp(-global_step / 150) + random.uniform(0.0, 0.02))
accuracy: float = min(0.99, 0.50 + global_step / 400 + random.uniform(0.0, 0.03))
running_loss += loss
running_accuracy += accuracy
global_step += 1
if global_step % LOG_EVERY_N_STEPS == 0:
# Log aggregated metrics instead of spamming the network every batch.
wandb.log(
{
"train/loss": running_loss / LOG_EVERY_N_STEPS,
"train/accuracy": running_accuracy / LOG_EVERY_N_STEPS,
"epoch": epoch,
"global_step": global_step,
},
step=global_step,
)
running_loss = 0.0
running_accuracy = 0.0
wandb.log({"status": "completed", "final_step": global_step})
finally:
# Always close the run, even if the script errors or exits early.
wandb.finish()
if __name__ == "__main__":
train()
Module 2: Rich Media & Interactive Visualizations
W&B is not just for scalar loss curves. In production ML, the question is often not “did accuracy go up?” but “what kinds of examples fail, how do predictions drift, and what evidence do we have beyond a single number?” Media logging makes those answers visible.
Logging Media
Images, plots, and audio samples are especially useful when the output itself is the product. In a vision pipeline, logging one prediction image every 10 epochs is like checking real specimens in a lab instead of only reading aggregate metrics on a dashboard.
from __future__ import annotations
import numpy as np
import matplotlib.pyplot as plt
import wandb
run = wandb.init(project="media-demo", config={"epochs": 30})
try:
sample_rate: int = 16000
for epoch in range(wandb.config.epochs):
if epoch % 10 == 0:
# Example image payload.
image_array = np.random.randint(0, 255, size=(128, 128, 3), dtype=np.uint8)
# Example matplotlib figure payload.
fig, ax = plt.subplots(figsize=(4, 3))
ax.plot(np.linspace(0, 1, 50), np.random.rand(50))
ax.set_title(f"Prediction confidence at epoch {epoch}")
# Example audio payload.
audio_waveform = np.random.uniform(-0.1, 0.1, size=(sample_rate,)).astype(np.float32)
wandb.log(
{
"examples/prediction_image": wandb.Image(image_array, caption=f"Epoch {epoch} sample"),
"charts/confidence_curve": wandb.Image(fig),
"audio/sample_output": wandb.Audio(audio_waveform, sample_rate=sample_rate, caption=f"Epoch {epoch}"),
"epoch": epoch,
}
)
plt.close(fig)
finally:
wandb.finish()
W&B Tables
wandb.Table() is one of the most useful features for model debugging because it captures row-level evidence. Instead of seeing “validation accuracy = 0.88,” you can inspect the exact predictions, labels, and confidence scores that created that number. It turns the UI into an interactive dataframe for failure analysis.
from __future__ import annotations
import random
import wandb
run = wandb.init(project="table-demo")
try:
prediction_table = wandb.Table(columns=["text", "actual_label", "predicted_label", "confidence"])
rows = [
("refund still not received after 10 days", "billing", "billing", 0.97),
("the app crashes on launch on ios", "bug", "bug", 0.93),
("can you add dark mode to the dashboard", "feature", "billing", 0.41),
]
for text, actual_label, predicted_label, confidence in rows:
prediction_table.add_data(text, actual_label, predicted_label, confidence)
wandb.log({"validation/prediction_table": prediction_table})
finally:
wandb.finish()
Module 3: Hyperparameter Optimization (W&B Sweeps)
Sweeps separate parameter search orchestration from local execution. The W&B backend acts as the controller that proposes the next hyperparameter set, while your machine runs the training job as an agent. That division matters because it lets teams coordinate many experiments without custom orchestration glue.
Sweep Configuration
A sweep config defines what metric to optimize, how to search, and what parameter ranges are legal. Think of it as a search contract: you are telling the controller what it is allowed to try and how success should be measured.
import wandb
sweep_config = {
"method": "bayes",
"metric": {
"name": "val/accuracy",
"goal": "maximize",
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 1e-5,
"max": 1e-2,
},
"batch_size": {
"values": [32, 64, 128],
},
"dropout": {
"distribution": "uniform",
"min": 0.1,
"max": 0.5,
},
"optimizer": {
"values": ["adamw", "sgd"],
},
},
}
method: random
metric:
name: val/loss
goal: minimize
parameters:
learning_rate:
distribution: log_uniform_values
min: 0.00001
max: 0.01
batch_size:
values: [32, 64, 128]
Execution
The local training function reads whatever parameters the W&B controller assigned to the current run. That means your training function stays standard, while the sweep controller decides which configuration should run next.
from __future__ import annotations
import random
import wandb
sweep_config = {
"method": "random",
"metric": {"name": "val/accuracy", "goal": "maximize"},
"parameters": {
"learning_rate": {"values": [1e-4, 5e-4, 1e-3]},
"batch_size": {"values": [32, 64]},
},
}
def train_sweep() -> None:
with wandb.init(project="sweep-demo") as run:
config = wandb.config
score = 0.70 + random.uniform(0.0, 0.2)
# Replace this with the real training loop using config.learning_rate, etc.
wandb.log(
{
"val/accuracy": score,
"learning_rate": config.learning_rate,
"batch_size": config.batch_size,
}
)
sweep_id = wandb.sweep(sweep=sweep_config, project="sweep-demo")
wandb.agent(sweep_id, function=train_sweep, count=10)
Module 4: Data & Model Versioning (Artifacts)
You should never upload a 5GB model weight file through wandb.log() as if it were just another metric. Artifacts exist to track datasets, models, and derived assets with lineage. They answer the question “which exact asset version fed this run?” instead of leaving that answer in a README or file path convention.
Logging a Dataset / Model
Artifacts are the right primitive for durable assets because they version files, preserve relationships, and show downstream usage. This is closer to package management than metric logging.
from __future__ import annotations
from pathlib import Path
import wandb
run = wandb.init(project="artifact-demo", job_type="train")
try:
dataset_artifact = wandb.Artifact(name="raw-data", type="dataset")
dataset_artifact.add_dir("data/raw")
run.log_artifact(dataset_artifact)
model_artifact = wandb.Artifact(name="fraud-model", type="model")
model_artifact.add_file("checkpoints/model.pt")
model_artifact.metadata = {
"framework": "pytorch",
"stage": "candidate",
}
run.log_artifact(model_artifact)
finally:
wandb.finish()
Consuming an Artifact
Downstream jobs such as evaluation, batch inference, or deployment should consume a named artifact version instead of a loose file path. That is how you make a deployment pipeline deterministic instead of depending on whichever file happened to be in a directory.
from __future__ import annotations
from pathlib import Path
import wandb
run = wandb.init(project="artifact-demo", job_type="deploy")
try:
artifact = run.use_artifact("fraud-model:latest")
artifact_dir: str = artifact.download()
model_path = Path(artifact_dir) / "model.pt"
print(f"Downloaded model artifact to {model_path}")
finally:
wandb.finish()
Module 5: Seamless Framework Integrations
One reason teams adopt W&B quickly is that it integrates with common training frameworks without requiring manual metric plumbing everywhere. The core benefit is leverage: you keep the framework's training abstractions while still getting centralized tracking and visualization.
Hugging Face
For Hugging Face workflows, W&B usually becomes a one-line integration by setting report_to="wandb". That is useful because it turns standard trainer outputs into shared experiment records with almost no extra code.
from __future__ import annotations
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
training_args = TrainingArguments(
output_dir="./outputs",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
evaluation_strategy="epoch",
logging_strategy="steps",
logging_steps=50,
report_to="wandb",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
trainer.train()
PyTorch Lightning / Keras
The Lightning and Keras integrations work for the same reason: W&B plugs into the framework's callback or logger system, so training events are forwarded automatically instead of being logged by hand.
from lightning.pytorch.loggers import WandbLogger wandb_logger = WandbLogger(project="lightning-project", log_model="all") # Trainer(logger=wandb_logger, max_epochs=10)
from wandb.integration.keras import WandbCallback # model.fit(x_train, y_train, callbacks=[WandbCallback()])
Module 6: Collaboration & Reports
Once runs are tracked consistently, the web UI becomes the team's shared control room. This is where W&B moves from a personal logger to a collaboration platform: dashboards, grouped runs, run comparison, tables, media, and reportable narratives all live in one place.
The UI Experience
The W&B UI is most powerful when your runs are grouped with intention. If every run is just dumped into a project with random names, the dashboard becomes clutter. If runs carry groups, tags, job types, and consistent metric names, the UI becomes queryable and comparable.
from __future__ import annotations
import wandb
run = wandb.init(
project="ranking-system",
group="ablation-learning-rate",
job_type="train",
tags=["ablation", "ranking", "resnet-encoder"],
config={
"learning_rate": 5e-4,
"batch_size": 128,
"encoder": "resnet50",
},
)
try:
# Define how metrics should be summarized in the dashboard.
wandb.define_metric("epoch")
wandb.define_metric("val/accuracy", summary="max")
wandb.define_metric("val/loss", summary="min")
wandb.log({"epoch": 1, "val/accuracy": 0.86, "val/loss": 0.42})
finally:
wandb.finish()
In practice, group runs by the variable you are trying to compare, such as optimizer family, data slice, or backbone. That way the dashboard answers actual experiment questions instead of acting as a flat dump of run history.
W&B Reports
A W&B Report is essentially a live research memo: text plus embedded charts, tables, media, and comparisons pulled directly from tracked runs. The reason teams like reports is that they remove the stale-slide problem. When new runs finish, the embedded visuals can update without someone rebuilding screenshots manually.
A good report explains the hypothesis, links the relevant run groups, embeds the charts that support the conclusion, and records the next action for stakeholders.
- Use reports for decisions, not dumps. Frame the hypothesis, the comparison set, and the conclusion.
- Embed live panels. Prefer linked charts over screenshots so new results appear automatically.
- Share with stakeholders. Reports are ideal when product, research, and engineering need one living source of truth.
Module 7: Common Pitfalls & Anti-Patterns
W&B is useful, but it is easy to misuse. Most failure modes are not bugs in the SDK. They are workflow mistakes: logging too often, never closing runs, or treating artifacts like disposable uploads instead of reusable lineage objects.
1. Network Bottlenecks
Calling wandb.log() inside a very tight inner loop, especially every batch, can slow training and generate noisy plots. The fix is to aggregate metrics and log every N steps. The analogy is batching telemetry packets instead of opening a new network conversation for every tiny event.
from __future__ import annotations
import wandb
run = wandb.init(project="pitfalls-demo")
try:
accumulated_loss: float = 0.0
log_every: int = 50
for step in range(1000):
loss = 1.0 / (step + 1)
accumulated_loss += loss
if (step + 1) % log_every == 0:
wandb.log({"train/loss": accumulated_loss / log_every, "step": step + 1}, step=step + 1)
accumulated_loss = 0.0
finally:
wandb.finish()
2. Orphaned Runs
Runs get orphaned when notebooks are interrupted or scripts crash before cleanup. That leaves runs stuck in a running or crashed state and makes the project messy. The fix is to always close the run, preferably in a finally block or context manager.
from __future__ import annotations
import wandb
def run_experiment() -> None:
run = wandb.init(project="pitfalls-demo")
try:
wandb.log({"status": "started"})
# Replace with real notebook or training logic.
raise RuntimeError("simulated failure")
except Exception as exc:
wandb.log({"error_message": str(exc)})
raise
finally:
wandb.finish()
3. Artifact Bloat
Uploading the same massive dataset on every run wastes bandwidth and storage. The correct pattern is to publish the dataset artifact once, version it when it truly changes, and have all downstream runs reference that artifact. Think of it like referencing a package version instead of emailing the same ZIP file to every teammate.
from __future__ import annotations
import wandb
def log_dataset_once() -> None:
with wandb.init(project="artifact-governance", job_type="data-publish") as run:
artifact = wandb.Artifact(name="customer-churn-dataset", type="dataset")
artifact.add_dir("data/customer_churn")
run.log_artifact(artifact)
def reference_existing_dataset() -> None:
with wandb.init(project="artifact-governance", job_type="train") as run:
artifact = run.use_artifact("customer-churn-dataset:latest")
dataset_dir = artifact.download()
print(dataset_dir)
Reference Links
- W&B Docs Official W&B documentation
- SDK Python SDK reference
- Sweeps W&B Sweeps guide
- Artifacts Artifacts guide
- Reports Reports guide
- Integrations Framework integrations