Weights & Biases (W&B) Enterprise Handbook

A production-ready guide to experiment tracking, media logging, sweeps, artifacts, framework integrations, and collaboration patterns that keep ML teams reproducible and out of spreadsheet chaos.

Experiment Tracking Artifacts + Lineage Sweeps + Dashboards April 2026

Why W&B matters: once a team has more than a few experiments, the failure mode is almost always the same: nobody knows which run produced the current model, which hyperparameters were used, or where the correct metrics live. W&B solves that by making runs, config, artifacts, and collaboration part of the workflow instead of an afterthought.

Module 1: The Foundation (Tracking Experiments)

Experiment tracking is the first layer of MLOps discipline. Without it, teams end up with model names like model_v4_final_really_final.pt and no reliable way to reproduce results. W&B gives every run an identity, a config, a metric history, and a home in the UI.

Reproducibility Rule config + artifacts + code

A run becomes reproducible when you can recover the hyperparameters, the code version, and the exact dataset or model artifact it used. W&B is valuable because it ties those pieces together.

The Setup

Never hardcode a W&B API key into source code. Treat it like any other credential. The secure pattern is to store the key in an environment variable and let wandb login or wandb.login() read it from the environment. That keeps secrets out of repos and CI logs.

# PowerShell
$env:WANDB_API_KEY="your_api_key_here"
wandb login

# Bash
export WANDB_API_KEY="your_api_key_here"
wandb login

from __future__ import annotations

import os

import wandb


def login_to_wandb() -> None:
    api_key: str | None = os.getenv("WANDB_API_KEY")
    if not api_key:
        raise RuntimeError("WANDB_API_KEY is not set in the environment.")

    # Read the key from the environment instead of hardcoding secrets in source.
    wandb.login(key=api_key, relogin=True)


if __name__ == "__main__":
    login_to_wandb()

Init & Config

wandb.init() starts a tracked run. The config dictionary is critical because it turns hyperparameters into filterable metadata in the W&B UI. That means you can group runs by learning rate, batch size, optimizer, or architecture instead of reverse-engineering those choices from notes later.

from __future__ import annotations

import wandb


run = wandb.init(
    project="my-project",
    job_type="train",
    group="resnet50-baselines",
    tags=["baseline", "image-classification"],
    config={
        "learning_rate": 1e-3,
        "batch_size": 64,
        "epochs": 10,
        "optimizer": "adamw",
        "model_name": "resnet50",
        "dataset_name": "imagenette",
    },
)

# Access config values the same way everywhere in your training code.
learning_rate: float = wandb.config.learning_rate
batch_size: int = wandb.config.batch_size

print(f"Tracking run: {run.name}")

✓

Why config matters: if the team cannot filter by hyperparameter, you eventually fall back to screenshots and spreadsheets. Config turns the UI into a real query surface for experiments.

The Training Loop

The right pattern is to log metrics at meaningful checkpoints, not mindlessly on every inner-loop operation. A good mental model is that W&B is your experiment journal. You write entries often enough to reconstruct the story, but not so often that journaling becomes the workload.

from __future__ import annotations

import math
import random
from typing import Final

import wandb


TOTAL_EPOCHS: Final[int] = 5
LOG_EVERY_N_STEPS: Final[int] = 20


def train() -> None:
    run = wandb.init(
        project="my-project",
        config={
            "epochs": TOTAL_EPOCHS,
            "learning_rate": 1e-3,
            "model_name": "toy-classifier",
        },
    )

    try:
        global_step: int = 0

        for epoch in range(TOTAL_EPOCHS):
            running_loss: float = 0.0
            running_accuracy: float = 0.0

            for step in range(100):
                # Simulate a real training step. Replace with your model code.
                loss: float = max(0.05, math.exp(-global_step / 150) + random.uniform(0.0, 0.02))
                accuracy: float = min(0.99, 0.50 + global_step / 400 + random.uniform(0.0, 0.03))

                running_loss += loss
                running_accuracy += accuracy
                global_step += 1

                if global_step % LOG_EVERY_N_STEPS == 0:
                    # Log aggregated metrics instead of spamming the network every batch.
                    wandb.log(
                        {
                            "train/loss": running_loss / LOG_EVERY_N_STEPS,
                            "train/accuracy": running_accuracy / LOG_EVERY_N_STEPS,
                            "epoch": epoch,
                            "global_step": global_step,
                        },
                        step=global_step,
                    )
                    running_loss = 0.0
                    running_accuracy = 0.0

        wandb.log({"status": "completed", "final_step": global_step})
    finally:
        # Always close the run, even if the script errors or exits early.
        wandb.finish()


if __name__ == "__main__":
    train()

Module 2: Rich Media & Interactive Visualizations

W&B is not just for scalar loss curves. In production ML, the question is often not “did accuracy go up?” but “what kinds of examples fail, how do predictions drift, and what evidence do we have beyond a single number?” Media logging makes those answers visible.

Logging Media

Images, plots, and audio samples are especially useful when the output itself is the product. In a vision pipeline, logging one prediction image every 10 epochs is like checking real specimens in a lab instead of only reading aggregate metrics on a dashboard.

from __future__ import annotations

import numpy as np
import matplotlib.pyplot as plt
import wandb


run = wandb.init(project="media-demo", config={"epochs": 30})

try:
    sample_rate: int = 16000

    for epoch in range(wandb.config.epochs):
        if epoch % 10 == 0:
            # Example image payload.
            image_array = np.random.randint(0, 255, size=(128, 128, 3), dtype=np.uint8)

            # Example matplotlib figure payload.
            fig, ax = plt.subplots(figsize=(4, 3))
            ax.plot(np.linspace(0, 1, 50), np.random.rand(50))
            ax.set_title(f"Prediction confidence at epoch {epoch}")

            # Example audio payload.
            audio_waveform = np.random.uniform(-0.1, 0.1, size=(sample_rate,)).astype(np.float32)

            wandb.log(
                {
                    "examples/prediction_image": wandb.Image(image_array, caption=f"Epoch {epoch} sample"),
                    "charts/confidence_curve": wandb.Image(fig),
                    "audio/sample_output": wandb.Audio(audio_waveform, sample_rate=sample_rate, caption=f"Epoch {epoch}"),
                    "epoch": epoch,
                }
            )

            plt.close(fig)
finally:
    wandb.finish()

W&B Tables

wandb.Table() is one of the most useful features for model debugging because it captures row-level evidence. Instead of seeing “validation accuracy = 0.88,” you can inspect the exact predictions, labels, and confidence scores that created that number. It turns the UI into an interactive dataframe for failure analysis.

from __future__ import annotations

import random

import wandb


run = wandb.init(project="table-demo")

try:
    prediction_table = wandb.Table(columns=["text", "actual_label", "predicted_label", "confidence"])

    rows = [
        ("refund still not received after 10 days", "billing", "billing", 0.97),
        ("the app crashes on launch on ios", "bug", "bug", 0.93),
        ("can you add dark mode to the dashboard", "feature", "billing", 0.41),
    ]

    for text, actual_label, predicted_label, confidence in rows:
        prediction_table.add_data(text, actual_label, predicted_label, confidence)

    wandb.log({"validation/prediction_table": prediction_table})
finally:
    wandb.finish()

Module 3: Hyperparameter Optimization (W&B Sweeps)

Sweeps separate parameter search orchestration from local execution. The W&B backend acts as the controller that proposes the next hyperparameter set, while your machine runs the training job as an agent. That division matters because it lets teams coordinate many experiments without custom orchestration glue.

Sweep Configuration

A sweep config defines what metric to optimize, how to search, and what parameter ranges are legal. Think of it as a search contract: you are telling the controller what it is allowed to try and how success should be measured.

import wandb


sweep_config = {
    "method": "bayes",
    "metric": {
        "name": "val/accuracy",
        "goal": "maximize",
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-2,
        },
        "batch_size": {
            "values": [32, 64, 128],
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5,
        },
        "optimizer": {
            "values": ["adamw", "sgd"],
        },
    },
}

method: random
metric:
  name: val/loss
  goal: minimize
parameters:
  learning_rate:
    distribution: log_uniform_values
    min: 0.00001
    max: 0.01
  batch_size:
    values: [32, 64, 128]

Execution

The local training function reads whatever parameters the W&B controller assigned to the current run. That means your training function stays standard, while the sweep controller decides which configuration should run next.

from __future__ import annotations

import random

import wandb


sweep_config = {
    "method": "random",
    "metric": {"name": "val/accuracy", "goal": "maximize"},
    "parameters": {
        "learning_rate": {"values": [1e-4, 5e-4, 1e-3]},
        "batch_size": {"values": [32, 64]},
    },
}


def train_sweep() -> None:
    with wandb.init(project="sweep-demo") as run:
        config = wandb.config
        score = 0.70 + random.uniform(0.0, 0.2)

        # Replace this with the real training loop using config.learning_rate, etc.
        wandb.log(
            {
                "val/accuracy": score,
                "learning_rate": config.learning_rate,
                "batch_size": config.batch_size,
            }
        )


sweep_id = wandb.sweep(sweep=sweep_config, project="sweep-demo")
wandb.agent(sweep_id, function=train_sweep, count=10)

Module 4: Data & Model Versioning (Artifacts)

You should never upload a 5GB model weight file through wandb.log() as if it were just another metric. Artifacts exist to track datasets, models, and derived assets with lineage. They answer the question “which exact asset version fed this run?” instead of leaving that answer in a README or file path convention.

✓

Reproducibility depends on lineage: config tells you how a run was executed, and artifacts tell you what concrete data and model files it used. Together they make a run reproducible for any teammate.

Logging a Dataset / Model

Artifacts are the right primitive for durable assets because they version files, preserve relationships, and show downstream usage. This is closer to package management than metric logging.

from __future__ import annotations

from pathlib import Path

import wandb


run = wandb.init(project="artifact-demo", job_type="train")

try:
    dataset_artifact = wandb.Artifact(name="raw-data", type="dataset")
    dataset_artifact.add_dir("data/raw")
    run.log_artifact(dataset_artifact)

    model_artifact = wandb.Artifact(name="fraud-model", type="model")
    model_artifact.add_file("checkpoints/model.pt")
    model_artifact.metadata = {
        "framework": "pytorch",
        "stage": "candidate",
    }
    run.log_artifact(model_artifact)
finally:
    wandb.finish()

Consuming an Artifact

Downstream jobs such as evaluation, batch inference, or deployment should consume a named artifact version instead of a loose file path. That is how you make a deployment pipeline deterministic instead of depending on whichever file happened to be in a directory.

from __future__ import annotations

from pathlib import Path

import wandb


run = wandb.init(project="artifact-demo", job_type="deploy")

try:
    artifact = run.use_artifact("fraud-model:latest")
    artifact_dir: str = artifact.download()

    model_path = Path(artifact_dir) / "model.pt"
    print(f"Downloaded model artifact to {model_path}")
finally:
    wandb.finish()

Module 5: Seamless Framework Integrations

One reason teams adopt W&B quickly is that it integrates with common training frameworks without requiring manual metric plumbing everywhere. The core benefit is leverage: you keep the framework's training abstractions while still getting centralized tracking and visualization.

Hugging Face

For Hugging Face workflows, W&B usually becomes a one-line integration by setting report_to="wandb". That is useful because it turns standard trainer outputs into shared experiment records with almost no extra code.

from __future__ import annotations

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments


model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    evaluation_strategy="epoch",
    logging_strategy="steps",
    logging_steps=50,
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

trainer.train()

PyTorch Lightning / Keras

The Lightning and Keras integrations work for the same reason: W&B plugs into the framework's callback or logger system, so training events are forwarded automatically instead of being logged by hand.

from lightning.pytorch.loggers import WandbLogger


wandb_logger = WandbLogger(project="lightning-project", log_model="all")

# Trainer(logger=wandb_logger, max_epochs=10)

from wandb.integration.keras import WandbCallback


# model.fit(x_train, y_train, callbacks=[WandbCallback()])

Module 6: Collaboration & Reports

Once runs are tracked consistently, the web UI becomes the team's shared control room. This is where W&B moves from a personal logger to a collaboration platform: dashboards, grouped runs, run comparison, tables, media, and reportable narratives all live in one place.

The UI Experience

The W&B UI is most powerful when your runs are grouped with intention. If every run is just dumped into a project with random names, the dashboard becomes clutter. If runs carry groups, tags, job types, and consistent metric names, the UI becomes queryable and comparable.

from __future__ import annotations

import wandb


run = wandb.init(
    project="ranking-system",
    group="ablation-learning-rate",
    job_type="train",
    tags=["ablation", "ranking", "resnet-encoder"],
    config={
        "learning_rate": 5e-4,
        "batch_size": 128,
        "encoder": "resnet50",
    },
)

try:
    # Define how metrics should be summarized in the dashboard.
    wandb.define_metric("epoch")
    wandb.define_metric("val/accuracy", summary="max")
    wandb.define_metric("val/loss", summary="min")

    wandb.log({"epoch": 1, "val/accuracy": 0.86, "val/loss": 0.42})
finally:
    wandb.finish()

In practice, group runs by the variable you are trying to compare, such as optimizer family, data slice, or backbone. That way the dashboard answers actual experiment questions instead of acting as a flat dump of run history.

W&B Reports

A W&B Report is essentially a live research memo: text plus embedded charts, tables, media, and comparisons pulled directly from tracked runs. The reason teams like reports is that they remove the stale-slide problem. When new runs finish, the embedded visuals can update without someone rebuilding screenshots manually.

Report Workflow Narrative + live assets

A good report explains the hypothesis, links the relevant run groups, embeds the charts that support the conclusion, and records the next action for stakeholders.

Use reports for decisions, not dumps. Frame the hypothesis, the comparison set, and the conclusion.
Embed live panels. Prefer linked charts over screenshots so new results appear automatically.
Share with stakeholders. Reports are ideal when product, research, and engineering need one living source of truth.

Module 7: Common Pitfalls & Anti-Patterns

W&B is useful, but it is easy to misuse. Most failure modes are not bugs in the SDK. They are workflow mistakes: logging too often, never closing runs, or treating artifacts like disposable uploads instead of reusable lineage objects.

1. Network Bottlenecks

Calling wandb.log() inside a very tight inner loop, especially every batch, can slow training and generate noisy plots. The fix is to aggregate metrics and log every N steps. The analogy is batching telemetry packets instead of opening a new network conversation for every tiny event.

from __future__ import annotations

import wandb


run = wandb.init(project="pitfalls-demo")

try:
    accumulated_loss: float = 0.0
    log_every: int = 50

    for step in range(1000):
        loss = 1.0 / (step + 1)
        accumulated_loss += loss

        if (step + 1) % log_every == 0:
            wandb.log({"train/loss": accumulated_loss / log_every, "step": step + 1}, step=step + 1)
            accumulated_loss = 0.0
finally:
    wandb.finish()

2. Orphaned Runs

Runs get orphaned when notebooks are interrupted or scripts crash before cleanup. That leaves runs stuck in a running or crashed state and makes the project messy. The fix is to always close the run, preferably in a finally block or context manager.

from __future__ import annotations

import wandb


def run_experiment() -> None:
    run = wandb.init(project="pitfalls-demo")
    try:
        wandb.log({"status": "started"})
        # Replace with real notebook or training logic.
        raise RuntimeError("simulated failure")
    except Exception as exc:
        wandb.log({"error_message": str(exc)})
        raise
    finally:
        wandb.finish()

3. Artifact Bloat

Uploading the same massive dataset on every run wastes bandwidth and storage. The correct pattern is to publish the dataset artifact once, version it when it truly changes, and have all downstream runs reference that artifact. Think of it like referencing a package version instead of emailing the same ZIP file to every teammate.

from __future__ import annotations

import wandb


def log_dataset_once() -> None:
    with wandb.init(project="artifact-governance", job_type="data-publish") as run:
        artifact = wandb.Artifact(name="customer-churn-dataset", type="dataset")
        artifact.add_dir("data/customer_churn")
        run.log_artifact(artifact)


def reference_existing_dataset() -> None:
    with wandb.init(project="artifact-governance", job_type="train") as run:
        artifact = run.use_artifact("customer-churn-dataset:latest")
        dataset_dir = artifact.download()
        print(dataset_dir)

Team rule: upload durable assets once, reference them many times, and update the artifact version only when the asset itself actually changes.