Back to handbooks index

MLOps & Model Deployment Handbook

Production-grade patterns for versioning data, tracking experiments, deploying models, and monitoring inference — because MLOps is not just DevOps for ML. It handles a fundamentally new axis of variability: Data and Model Weights.

MLOps DVC · MLflow · W&B FastAPI · Docker · ONNX March 2026
Core principle — Reproducibility: Every tool, snippet, and workflow in this handbook answers one question: "If the engineer who built this quits tomorrow, can the team reproduce the exact same model?" If the answer is no, the system is not production-ready.

Table of Contents

📦
Module 1: Versioning Data & Models
Why Git fails for large files. DVC alongside Git. S3 remotes for datasets and weights.
📊
Module 2: Experiment Tracking & Registry
Weights & Biases integration. MLflow Model Registry. Staging → Production lifecycle.
🔄
Module 3: Continuous Training Pipelines
DVC Pipelines as DAGs. Selective re-execution. Automated retraining triggers.
🚀
Module 4: Deployment & Model Serving
FastAPI endpoints. Docker for ML. Batch inference. Edge & browser deployment.
👁
Module 5: Monitoring & Observability
Silent failures. Data drift vs. concept drift. Evidently AI and Prometheus/Grafana.
Module 6: Pitfalls & Best Practices
Training-serving skew. Over-engineering V1. Ignoring baselines. Feature stores.
💡
Traditional DevOps vs. MLOps: In DevOps, you version code. In MLOps, you version code + data + model weights + hyperparameters + environment. A single change in any of these dimensions produces a different model.

Module 1 — Versioning Data & Models

The Problem: Why Git Fails for ML Artifacts

Git was designed for text files — source code, configs, markdown. It stores every version of every file in the .git directory. When you try to commit a 50GB training dataset or a 2GB .pt PyTorch checkpoint, three things go wrong immediately:

ProblemWhat HappensImpact
Repository bloatEvery version of the dataset is stored in .git/objectsA 50GB dataset with 10 versions → 500GB repo
Clone timesgit clone downloads the entire historyNew team members wait hours to clone
GitHub/GitLab limitsFile size limits (100MB GitHub, 5GB GitLab LFS)Push rejected; workflow broken
No diffingGit can't meaningfully diff binary filesgit log becomes useless for data changes
Real scenario: A data scientist runs git add training_data.parquet on a 12GB file. Git computes a SHA-1 hash of the entire file, copies it into .git/objects, and now the repo is permanently 12GB larger — even if you delete the file in the next commit. You'd need git filter-branch or BFG Repo Cleaner to fix it.

Git LFS — A Partial Solution (and Why It's Not Enough)

Git Large File Storage (LFS) replaces large files with lightweight pointers inside the Git repo, while storing the actual file contents on a remote server. This helps — but it has critical limitations for ML:

DVC — Data Version Control

DVC (Data Version Control) solves this by working alongside Git. The architecture is elegant:

DVC Architecture git + dvc

Git stores: Small pointer files (.dvc files) containing the MD5 hash of the actual data.
DVC stores: The actual heavy files (datasets, model weights) in a configured remote (S3, GCS, Azure Blob, local NAS).
Result: Your repo stays lightweight. Your data is versioned, shareable, and reproducible.

git init dvc init dvc add data/ dvc remote add dvc push git commit

Step-by-Step: Initialize DVC in a Project

# Step 1: Create a new project and initialize Git + DVC
mkdir ml-fraud-detection && cd ml-fraud-detection
git init
pip install dvc[s3]          # Install DVC with S3 support (also: dvc[gs], dvc[azure])
dvc init                     # Creates .dvc/ directory and .dvcignore

# DVC init creates these files automatically:
#   .dvc/.gitignore    — prevents DVC cache from entering Git
#   .dvc/config        — DVC configuration
#   .dvcignore         — like .gitignore but for DVC

git add .dvc .dvcignore
git commit -m "Initialize DVC"

Step 2: Add a Large Dataset to DVC

# Assume we have a 12GB raw dataset
ls -lh data/
# data/transactions_2024.parquet   12GB
# data/labels.csv                  450MB

# Track the entire data/ directory with DVC
dvc add data/

# This creates two things:
# 1. data.dvc         — a small pointer file (YAML) with the MD5 hash
# 2. data/.gitignore  — prevents Git from tracking the actual data files

# Let's inspect the pointer file:
cat data.dvc
# outs:
# - md5: a1b2c3d4e5f6...  (hash of the data directory)
#   size: 12884901888
#   nfiles: 2
#   path: data

# Now commit the POINTER file to Git (not the data itself)
git add data.dvc data/.gitignore
git commit -m "Track training data v1 with DVC"

Step 3: Configure an S3 Remote and Push

# Configure S3 as the DVC remote storage
dvc remote add -d myremote s3://my-mlops-bucket/dvc-store
# -d flag makes this the default remote

# Optional: set S3 region and endpoint (for MinIO or other S3-compatible stores)
dvc remote modify myremote region us-east-1

# Push the actual data files to S3
dvc push
# Uploading data/transactions_2024.parquet to s3://my-mlops-bucket/dvc-store/...
# 2 files pushed

# Commit the remote configuration
git add .dvc/config
git commit -m "Configure S3 remote for DVC storage"
git push origin main

Step 4: Reproduce on Another Machine

# A teammate clones the repo — it's fast because no heavy files in Git
git clone https://github.com/team/ml-fraud-detection.git
cd ml-fraud-detection

# Pull the actual data from S3 using the pointer files
dvc pull
# Downloading data/transactions_2024.parquet from s3://...
# 2 files fetched, 2 files updated

# The data directory is now identical to the original — same hash, same bytes

Versioning Model Weights

# After training, track the model checkpoint
dvc add models/fraud_detector_v1.pt   # 2.1GB PyTorch model

git add models/fraud_detector_v1.pt.dvc models/.gitignore
git commit -m "Add trained model v1 — F1=0.87, AUC=0.94"
git tag -a "model-v1" -m "Baseline fraud detector"

dvc push

# Later, retrain with new data → track v2
dvc add models/fraud_detector_v2.pt
git add models/fraud_detector_v2.pt.dvc
git commit -m "Model v2 — F1=0.91, AUC=0.96 (added transaction velocity features)"
git tag -a "model-v2" -m "Improved with velocity features"
dvc push

# Roll back to v1 if v2 degrades in production
git checkout model-v1
dvc checkout              # Restores the v1 model weights from cache/remote
💡
DVC + Git Tags = Time Travel for ML. Every Git tag maps to a specific dataset version and model version. You can git checkout any tag and dvc checkout to get the exact data and model from that point in time. This is the foundation of ML reproducibility.

DVC Commands Cheat Sheet

CommandPurposeGit Equivalent
dvc initInitialize DVC in a Git repogit init
dvc add <file>Start tracking a file/directorygit add
dvc pushUpload tracked files to remotegit push
dvc pullDownload tracked files from remotegit pull
dvc checkoutSync data files to match current Git commitgit checkout
dvc diffShow changes in tracked datagit diff
dvc remote addConfigure storage backendgit remote add
dvc gcGarbage collect unused cachegit gc

Module 2 — Experiment Tracking & Model Registry

The Messy Notebook Problem

Every data science team has experienced this: a shared drive or repo full of files named model_v1.ipynb, model_v2_fixed.ipynb, model_v3_FINAL.ipynb, model_v3_FINAL_actually_final.ipynb. Inside each notebook, hyperparameters are hardcoded, results are printed to stdout, and there's no way to compare runs systematically.

Anti-Pattern: The Notebook Graveyard chaos

Without experiment tracking, you lose: (1) which hyperparameters produced which metrics, (2) which dataset version was used, (3) which code version was run, (4) system metrics (GPU utilization, memory), and (5) the ability to compare 50 runs side-by-side.

Experiment tracking tools solve this by automatically logging every detail of every training run into a centralized database with a UI for comparison, filtering, and analysis.

Weights & Biases (W&B)

W&B is a hosted experiment tracking platform that provides real-time dashboards, hyperparameter sweeps, model versioning, and team collaboration. It's the industry standard for experiment tracking in research and production ML teams.

W&B Integration with a Scikit-Learn Training Loop

# pip install wandb scikit-learn pandas
import wandb
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, classification_report
import joblib

# ── Step 1: Initialize a W&B run ──────────────────────────────────────
# Every call to wandb.init() creates a new "run" in your W&B project.
# All logs, metrics, and artifacts are grouped under this run.
run = wandb.init(
    project="fraud-detection",       # Groups runs under one project dashboard
    name="rf-baseline-v1",            # Human-readable name for this specific run
    config={                            # Hyperparameters — logged and searchable
        "model_type": "RandomForest",
        "n_estimators": 200,
        "max_depth": 15,
        "min_samples_split": 5,
        "class_weight": "balanced",  # Important for imbalanced fraud data
        "dataset_version": "v2.1",
        "feature_count": 47,
        "test_size": 0.2,
    },
    tags=["baseline", "sklearn", "fraud"],
)

# ── Step 2: Load data and split ────────────────────────────────────────
df = pd.read_parquet("data/transactions_2024.parquet")
X = df.drop(columns=["is_fraud"])
y = df["is_fraud"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=run.config["test_size"], stratify=y, random_state=42
)

# Log dataset statistics — helps debug data issues later
wandb.log({
    "dataset/total_samples": len(df),
    "dataset/fraud_ratio": y.mean(),
    "dataset/train_size": len(X_train),
    "dataset/test_size": len(X_test),
})

# ── Step 3: Train the model ────────────────────────────────────────────
model = RandomForestClassifier(
    n_estimators=run.config["n_estimators"],
    max_depth=run.config["max_depth"],
    min_samples_split=run.config["min_samples_split"],
    class_weight=run.config["class_weight"],
    random_state=42,
    n_jobs=-1,                      # Use all CPU cores
)
model.fit(X_train, y_train)

# ── Step 4: Evaluate and log metrics ───────────────────────────────────
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)

# wandb.log() sends metrics to the dashboard in real-time
wandb.log({
    "metrics/f1_score": f1,
    "metrics/auc_roc": auc,
    "metrics/accuracy": model.score(X_test, y_test),
})

# Log feature importances as a W&B bar chart
feature_importance = pd.DataFrame({
    "feature": X.columns,
    "importance": model.feature_importances_,
}).sort_values("importance", ascending=False).head(20)

wandb.log({
    "feature_importances": wandb.Table(dataframe=feature_importance)
})

# ── Step 5: Save model as W&B Artifact ─────────────────────────────────
# Artifacts provide versioned, immutable model storage
joblib.dump(model, "models/rf_fraud_v1.joblib")
artifact = wandb.Artifact(
    name="fraud-detector",
    type="model",
    description="Random Forest baseline for fraud detection",
    metadata={"f1": f1, "auc": auc},
)
artifact.add_file("models/rf_fraud_v1.joblib")
run.log_artifact(artifact)

# ── Step 6: Close the run ──────────────────────────────────────────────
run.finish()
print(f"Run complete — F1: {f1:.4f}, AUC: {auc:.4f}")

W&B with a PyTorch Training Loop

# pip install wandb torch
import wandb
import torch
import torch.nn as nn
from torch.utils.data import DataLoader

# Initialize with PyTorch-specific config
run = wandb.init(
    project="fraud-detection",
    name="mlp-v1",
    config={
        "model_type": "MLP",
        "hidden_dim": 256,
        "learning_rate": 1e-3,
        "epochs": 50,
        "batch_size": 512,
        "optimizer": "AdamW",
        "weight_decay": 1e-4,
    },
)

# ... (model definition and data loading omitted for brevity) ...

# Training loop with W&B logging
for epoch in range(run.config["epochs"]):
    model.train()
    epoch_loss = 0.0
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        preds = model(batch_X)
        loss = criterion(preds, batch_y)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(train_loader)

    # Log per-epoch metrics — these appear as interactive charts in W&B
    wandb.log({
        "epoch": epoch,
        "train/loss": avg_loss,
        "train/learning_rate": optimizer.param_groups[0]["lr"],
    })

    # Validation every 5 epochs
    if epoch % 5 == 0:
        val_loss, val_f1 = evaluate(model, val_loader)  # Custom eval function
        wandb.log({
            "val/loss": val_loss,
            "val/f1_score": val_f1,
        })

# wandb.watch() — auto-logs gradients and parameter histograms
# Call this BEFORE the training loop for full gradient tracking
# wandb.watch(model, log="all", log_freq=100)

run.finish()

MLflow — Model Registry & Lifecycle

While W&B excels at experiment tracking and visualization, MLflow is the open-source standard for the model lifecycle: logging models, versioning them, and transitioning them through stages (Staging → Production → Archived). Many teams use both: W&B for experiment dashboards and MLflow for model registry and deployment.

CapabilityWeights & BiasesMLflow
Experiment tracking UIBest-in-class dashboards and comparisonGood, but more basic
Model RegistryArtifact-based, less structuredFirst-class with staging/production
HostingSaaS (hosted) — free tier availableSelf-hosted (or Databricks managed)
CostFree for individuals; paid for teamsFree and open-source
DeploymentNot a deployment toolBuilt-in mlflow models serve
Best forResearch teams, rapid prototypingProduction pipelines, model governance

MLflow: Log, Register, and Promote a Model

# pip install mlflow scikit-learn
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score
import pandas as pd

# ── Step 1: Set the tracking URI ───────────────────────────────────────
# MLflow stores runs in a "tracking server" — local or remote
mlflow.set_tracking_uri("http://mlflow-server.internal:5000")  # Or "sqlite:///mlflow.db" for local
mlflow.set_experiment("fraud-detection")

# ── Step 2: Start a run and log everything ─────────────────────────────
with mlflow.start_run(run_name="gbm-tuned-v3") as run:

    # Log hyperparameters
    params = {
        "n_estimators": 300,
        "max_depth": 8,
        "learning_rate": 0.05,
        "subsample": 0.8,
        "min_samples_leaf": 20,
    }
    mlflow.log_params(params)

    # Load data and train
    df = pd.read_parquet("data/transactions_2024.parquet")
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop(columns=["is_fraud"]), df["is_fraud"],
        test_size=0.2, stratify=df["is_fraud"], random_state=42
    )

    model = GradientBoostingClassifier(**params, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate and log metrics
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]

    metrics = {
        "f1_score": f1_score(y_test, y_pred),
        "auc_roc": roc_auc_score(y_test, y_prob),
        "accuracy": model.score(X_test, y_test),
    }
    mlflow.log_metrics(metrics)

    # ── Step 3: Log the trained model ──────────────────────────────────
    # mlflow.sklearn.log_model serializes the model with its conda/pip
    # environment, so it can be loaded anywhere without dependency issues
    mlflow.sklearn.log_model(
        sk_model=model,
        artifact_path="model",                     # Subdirectory in the run artifacts
        registered_model_name="fraud-detector",     # Auto-registers in Model Registry
        input_example=X_test.iloc[:5],              # Saves a sample for schema inference
    )

    # Log additional artifacts (feature importance plot, confusion matrix, etc.)
    mlflow.log_artifact("reports/confusion_matrix.png")

    print(f"Run ID: {run.info.run_id}")
    print(f"F1: {metrics['f1_score']:.4f}, AUC: {metrics['auc_roc']:.4f}")

Transitioning Model Stages in the Registry

# pip install mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri="http://mlflow-server.internal:5000")

# ── List all versions of the "fraud-detector" model ────────────────────
for mv in client.search_model_versions("name='fraud-detector'"):
    print(f"Version {mv.version}: stage={mv.current_stage}, run_id={mv.run_id}")

# ── Promote version 3 to Staging for A/B testing ──────────────────────
client.transition_model_version_stage(
    name="fraud-detector",
    version=3,
    stage="Staging",
    archive_existing_versions=False,    # Keep current Staging version for comparison
)

# ── After validation, promote to Production ────────────────────────────
client.transition_model_version_stage(
    name="fraud-detector",
    version=3,
    stage="Production",
    archive_existing_versions=True,    # Archive previous Production version
)

# ── Load the Production model anywhere in your codebase ────────────────
import mlflow.pyfunc

model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")
predictions = model.predict(new_data)   # Works with any MLflow-logged model flavor
💡
Best practice — use both tools: Log to W&B during experimentation for rich dashboards and hyperparameter sweeps. Once you've identified the best model, log it to MLflow's Model Registry for the Staging → Production promotion workflow. This gives you the best of both worlds.

Module 3 — Continuous Training & Pipelines

From Monolithic Script to DAG

Most ML projects start as a single Python script or Jupyter notebook: load data, preprocess, train, evaluate — all in one file. This seems simple, but it has a fundamental problem: when you change one thing, you have to re-run everything.

Imagine your training pipeline takes 4 hours. You fix a bug in the evaluation metric. Without a pipeline, you rerun all 4 hours. With a pipeline, only the evaluation step reruns (2 minutes).

Pipeline Architecture — DAG (Directed Acyclic Graph) dvc repro

An ML pipeline is a DAG — a directed acyclic graph of steps where each step declares its dependencies (inputs) and outputs. The pipeline runner (DVC, Airflow, Kubeflow) computes which steps need rerunning based on what changed.

data_prep.py feature_eng.py train.py evaluate.py metrics.json

Each step is a standalone script with clearly defined inputs and outputs. If you change feature_eng.py, only feature_eng, train, and evaluate rerun. data_prep is skipped because its inputs haven't changed.

DVC Pipelines — dvc.yaml

DVC Pipelines define your ML workflow in a dvc.yaml file. Each stage declares its command, dependencies (scripts + data), parameters, and outputs. DVC computes MD5 hashes of everything and only reruns stages where something changed.

Complete dvc.yaml Pipeline

# dvc.yaml — defines the full ML training pipeline
stages:

  # Stage 1: Data Preparation
  data_prep:
    cmd: python src/data_prep.py
    deps:                            # If any dependency changes, this stage reruns
      - src/data_prep.py             # The script itself
      - data/raw/                    # Raw data directory
    params:                          # Read from params.yaml — changes trigger rerun
      - data_prep.test_size
      - data_prep.random_seed
    outs:                            # Stage outputs — cached and tracked by DVC
      - data/processed/train.parquet
      - data/processed/test.parquet

  # Stage 2: Feature Engineering
  feature_engineering:
    cmd: python src/feature_eng.py
    deps:
      - src/feature_eng.py
      - data/processed/train.parquet
      - data/processed/test.parquet
    params:
      - features.velocity_window
      - features.amount_bins
    outs:
      - data/features/train_features.parquet
      - data/features/test_features.parquet

  # Stage 3: Model Training
  train:
    cmd: python src/train.py
    deps:
      - src/train.py
      - data/features/train_features.parquet
    params:
      - train.n_estimators
      - train.max_depth
      - train.learning_rate
    outs:
      - models/model.joblib          # Trained model — cached for reproduction

  # Stage 4: Evaluation
  evaluate:
    cmd: python src/evaluate.py
    deps:
      - src/evaluate.py
      - models/model.joblib
      - data/features/test_features.parquet
    metrics:                         # Special: tracked by DVC for comparison
      - reports/metrics.json:
          cache: false                # Don't cache — always regenerate
    plots:                           # Generates plots viewable with dvc plots
      - reports/confusion_matrix.csv:
          x: predicted
          y: actual

The params.yaml Configuration File

# params.yaml — single source of truth for all hyperparameters
data_prep:
  test_size: 0.2
  random_seed: 42

features:
  velocity_window: 3600            # Seconds — transaction velocity feature
  amount_bins: 10                  # Number of bins for amount discretization

train:
  n_estimators: 300
  max_depth: 8
  learning_rate: 0.05

Running and Reproducing the Pipeline

# Run the full pipeline — DVC determines what needs (re)running
dvc repro

# Output (first run — everything runs):
# Running stage 'data_prep':
#   > python src/data_prep.py
# Running stage 'feature_engineering':
#   > python src/feature_eng.py
# Running stage 'train':
#   > python src/train.py
# Running stage 'evaluate':
#   > python src/evaluate.py

# ── Now change a hyperparameter: ───────────────────────────────────────
# Edit params.yaml → train.n_estimators: 500
dvc repro

# Output (only affected stages rerun):
# Stage 'data_prep' didn't change, skipping
# Stage 'feature_engineering' didn't change, skipping
# Running stage 'train':                    ← only train + evaluate rerun
#   > python src/train.py
# Running stage 'evaluate':
#   > python src/evaluate.py

# ── Compare metrics across experiments: ────────────────────────────────
dvc metrics diff
# Path              Metric    Old      New
# reports/metrics   f1_score  0.87     0.91
# reports/metrics   auc_roc   0.94     0.96

# ── View plots: ────────────────────────────────────────────────────────
dvc plots show reports/confusion_matrix.csv
How DVC knows what to skip: DVC computes MD5 hashes of every dependency (script files, data files, parameters). It stores these hashes in dvc.lock. On dvc repro, it recomputes hashes and only reruns stages where at least one dependency hash has changed. This is the same principle as make — but for ML.

Triggering Pipelines Automatically (CI/CD)

# .github/workflows/retrain.yml — GitHub Actions pipeline trigger
name: Retrain Model
on:
  schedule:
    - cron: "0 2 * * 1"            # Every Monday at 2am — weekly retraining
  push:
    paths:
      - "data/**"                   # Retrain when data changes
      - "src/**"                    # Retrain when code changes
      - "params.yaml"              # Retrain when hyperparams change

jobs:
  retrain:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - run: dvc pull                # Fetch data from S3
      - run: dvc repro               # Run the pipeline
      - run: dvc push                # Push new outputs back to S3
      - run: |
          git add dvc.lock reports/metrics.json
          git commit -m "[CI] Retrain model — $(date +%Y-%m-%d)"
          git push

Module 4 — Deployment Environments & Model Serving

A trained model sitting in a .joblib file is worthless to the business. It needs to be served — accessible to other systems to make predictions. There are four primary deployment patterns, each suited to different latency, throughput, and cost requirements.

Real-Time REST API
Latency: <100ms. Use for: fraud detection at checkout, recommendation engines (<500 req/s). Stack: FastAPI + Uvicorn behind a load balancer.
📦
Batch Predictions
Latency: hours. Use for: scoring 10M users overnight, weekly churn predictions. Stack: PySpark / Databricks / Airflow + S3.
🔌
Streaming
Latency: seconds. Use for: real-time anomaly detection on IoT data. Stack: Kafka + Faust/Spark Streaming + model loaded in-process.
📱
Edge / Browser
Latency: <10ms. Use for: on-device inference (mobile, IoT). Stack: TensorFlow Lite, ONNX Runtime, Transformers.js.

Real-Time Serving with FastAPI

FastAPI is the standard Python framework for serving ML models via REST APIs. It's fast (built on Starlette + Uvicorn), auto-generates OpenAPI docs, and supports Pydantic for request/response validation — preventing malformed input from reaching your model.

Production FastAPI Model Server

# app/main.py — FastAPI model serving endpoint
# pip install fastapi uvicorn mlflow pydantic
import os
import logging
from contextlib import asynccontextmanager

import mlflow.pyfunc
import pandas as pd
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

logger = logging.getLogger(__name__)

# ── Request / Response schemas ─────────────────────────────────────────
# Pydantic validates every incoming request automatically.
# If a field is missing or the wrong type, FastAPI returns a 422 error.
class TransactionInput(BaseModel):
    amount: float = Field(gt=0, description="Transaction amount in USD")
    merchant_category: str = Field(description="MCC code (e.g., 'grocery', 'travel')")
    hour_of_day: int = Field(ge=0, le=23)
    day_of_week: int = Field(ge=0, le=6)
    transaction_velocity: float = Field(ge=0, description="Transactions in last hour")
    distance_from_home: float = Field(ge=0, description="Miles from user's home address")

class PredictionOutput(BaseModel):
    is_fraud: bool
    fraud_probability: float
    model_version: str

# ── Model loading at startup ───────────────────────────────────────────
# Load the model ONCE at startup, not per-request.
# The lifespan context manager ensures cleanup on shutdown.
ml_models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load model from MLflow Model Registry — "Production" stage
    model_uri = os.getenv("MODEL_URI", "models:/fraud-detector/Production")
    logger.info(f"Loading model from: {model_uri}")
    ml_models["fraud_detector"] = mlflow.pyfunc.load_model(model_uri)
    ml_models["model_version"] = model_uri
    logger.info("Model loaded successfully")
    yield
    # Cleanup on shutdown
    ml_models.clear()

app = FastAPI(
    title="Fraud Detection API",
    version="1.0.0",
    lifespan=lifespan,
)

# ── Health check endpoint ──────────────────────────────────────────────
@app.get("/health")
async def health():
    return {"status": "healthy", "model_loaded": "fraud_detector" in ml_models}

# ── Prediction endpoint ────────────────────────────────────────────────
@app.post("/predict", response_model=PredictionOutput)
async def predict(transaction: TransactionInput):
    try:
        # Convert the validated Pydantic model to a DataFrame
        # MLflow models expect pandas DataFrames as input
        input_df = pd.DataFrame([transaction.model_dump()])

        # Get prediction — model.predict returns numpy array
        model = ml_models["fraud_detector"]
        probability = float(model.predict(input_df)[0])

        return PredictionOutput(
            is_fraud=probability > 0.5,
            fraud_probability=round(probability, 4),
            model_version=ml_models["model_version"],
        )
    except Exception as e:
        logger.error(f"Prediction error: {e}")
        raise HTTPException(status_code=500, detail="Prediction failed")

# ── Run with: uvicorn app.main:app --host 0.0.0.0 --port 8000 ─────────

Testing the API

# Test with curl
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "amount": 4999.99,
    "merchant_category": "electronics",
    "hour_of_day": 3,
    "day_of_week": 2,
    "transaction_velocity": 12.0,
    "distance_from_home": 850.5
  }'

# Response:
# {
#   "is_fraud": true,
#   "fraud_probability": 0.9213,
#   "model_version": "models:/fraud-detector/Production"
# }

Containerization — Production Dockerfile

Packaging your FastAPI model server into a Docker container ensures it runs identically everywhere — your laptop, staging, production, Kubernetes. The key challenge with ML containers is image size: PyTorch + CUDA can exceed 8GB. Use multi-stage builds and CPU-only variants to keep images lean.

Production Dockerfile for ML

# Dockerfile — Production ML model server
# Multi-stage build to minimize final image size

# ── Stage 1: Builder (install dependencies) ────────────────────────────
FROM python:3.11-slim AS builder

# Prevents Python from writing .pyc files and enables unbuffered output
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

WORKDIR /build

# Install dependencies in a virtual environment for clean copying
RUN python -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Copy and install requirements FIRST (Docker layer caching)
# This layer is cached unless requirements.txt changes
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# ── Stage 2: Runtime (minimal image) ───────────────────────────────────
FROM python:3.11-slim AS runtime

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1

# Copy the virtual environment from builder
COPY --from=builder /opt/venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"

# Create a non-root user for security
RUN groupadd -r mluser && useradd -r -g mluser mluser

WORKDIR /app

# Copy application code
COPY app/ ./app/
COPY models/ ./models/

# Switch to non-root user
USER mluser

# Expose the port
EXPOSE 8000

# Health check — Docker/K8s uses this to determine container health
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

# Run with Uvicorn — 4 worker processes for production
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

requirements.txt (CPU-optimized)

# requirements.txt — use CPU-only PyTorch to save ~5GB in image size
fastapi==0.115.0
uvicorn[standard]==0.30.0
mlflow==2.16.0
pandas==2.2.0
scikit-learn==1.5.0
pydantic==2.9.0

# For PyTorch models, use the CPU-only variant:
# --extra-index-url https://download.pytorch.org/whl/cpu
# torch==2.4.0+cpu

Build and Run

# Build the image (tag with model version for traceability)
docker build -t fraud-detector:v3-2026.03 .

# Run locally for testing
docker run -p 8000:8000 \
  -e MODEL_URI="models:/fraud-detector/Production" \
  -e MLFLOW_TRACKING_URI="http://mlflow-server:5000" \
  fraud-detector:v3-2026.03

# Check image size — target under 1GB for sklearn models
docker images fraud-detector
# REPOSITORY        TAG             SIZE
# fraud-detector    v3-2026.03      487MB   ← Good: sklearn + FastAPI
Image size matters: A 6GB Docker image means 6GB pulled on every Kubernetes pod startup. With 10 replicas auto-scaling, that's 60GB of network transfer. Use python:3.11-slim (not python:3.11), install CPU-only variants of PyTorch/TensorFlow, and use multi-stage builds. Target: <500MB for sklearn, <2GB for PyTorch CPU.

Batch Predictions

Not every prediction needs to happen in real-time. Batch inference — scoring millions of records on a schedule — is often cheaper, simpler, and more reliable than maintaining a real-time API.

AspectReal-Time APIBatch Predictions
Latency<100ms per requestMinutes to hours for entire dataset
InfrastructureAlways-on servers, load balancers, auto-scalingSpin up, score, spin down (ephemeral)
Cost24/7 compute costsPay only when the job runs
Error handlingMust handle failures in real-timeRetry the entire job if it fails
Use casesFraud at checkout, chatbotsDaily churn scoring, weekly recommendations, marketing segmentation

PySpark Batch Scoring Example

# batch_score.py — Score 10M users with PySpark
# Runs as a scheduled Databricks job or Airflow task
import mlflow.pyfunc
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf
import pandas as pd

spark = SparkSession.builder.appName("batch-fraud-scoring").getOrCreate()

# Load the Production model from MLflow
model = mlflow.pyfunc.load_model("models:/fraud-detector/Production")

# Load the user transaction data (e.g., from a Delta Lake / S3 path)
transactions_df = spark.read.parquet("s3://data-lake/transactions/2026-03-30/")

# Define a Pandas UDF for distributed scoring across the Spark cluster
@pandas_udf("double")
def score_batch(amount: pd.Series, velocity: pd.Series, distance: pd.Series) -> pd.Series:
    # This function runs on each Spark executor in parallel
    input_df = pd.DataFrame({
        "amount": amount,
        "transaction_velocity": velocity,
        "distance_from_home": distance,
    })
    return pd.Series(model.predict(input_df))

# Score all 10M rows — distributed across the Spark cluster
scored_df = transactions_df.withColumn(
    "fraud_probability",
    score_batch("amount", "transaction_velocity", "distance_from_home")
)

# Write results back to the data lake for downstream consumers
scored_df.write.mode("overwrite").parquet("s3://data-lake/scores/fraud/2026-03-30/")

Edge & Browser Deployment

For ultra-low-latency or offline scenarios, models can be deployed directly to edge devices or web browsers:

TargetFormatToolUse Case
Mobile (Android/iOS)TFLite, Core MLTensorFlow Lite, Core ML ToolsOn-device image classification, voice commands
IoT / EmbeddedONNX, TFLite MicroONNX Runtime, TF MicroSensor anomaly detection, predictive maintenance
Web BrowserONNX, TF.jsTransformers.js, ONNX Runtime WebClient-side text classification, image segmentation
# Convert a PyTorch model to ONNX for cross-platform deployment
import torch

# Assume `model` is a trained PyTorch model and `dummy_input` matches input shape
dummy_input = torch.randn(1, 47)   # Batch of 1, 47 features

torch.onnx.export(
    model,
    dummy_input,
    "models/fraud_detector.onnx",
    input_names=["features"],
    output_names=["fraud_probability"],
    dynamic_axes={"features": {0: "batch_size"}},  # Allow variable batch size
    opset_version=17,
)

# The .onnx file can now run on:
# - ONNX Runtime (Python, C++, C#, Java)
# - ONNX Runtime Web (browser via WebAssembly)
# - ONNX Runtime Mobile (Android, iOS)

Module 5 — Model Monitoring & Observability

Silent Failures: The Invisible Problem

When a web server fails, it crashes. You get a 500 error, a stack trace, and an alert fires at 3am. When an ML model fails, nothing crashes. The model still returns predictions — they're just wrong. A fraud detection model might start approving fraudulent transactions. A recommendation engine might start surfacing irrelevant products. The API returns 200 OK every time.

ML Models Don't Crash — They Degrade Silently critical

Traditional monitoring (latency, error rates, CPU) catches infrastructure failures. But an ML model can have perfect latency and zero errors while its predictions are completely wrong. You need statistical monitoring — tracking the distribution of inputs and outputs against the training baseline.

This is why ML monitoring is fundamentally different from software monitoring. You need to track:

Data Drift vs. Concept Drift

These are the two primary modes of ML model degradation. Understanding the difference is essential for building the right monitoring and retraining strategy.

📊
Data Drift (Covariate Shift)
Definition: The distribution of input features changes, but the relationship between features and target stays the same.

Example: Your fraud model was trained when the average transaction was $50. After a holiday season, the average jumps to $200. The model receives inputs it rarely saw during training.

Detection: Compare input feature distributions (mean, std, quantiles) between training data and live production data.

Fix: Retrain on recent data that includes the new distribution.
🔀
Concept Drift
Definition: The relationship between features and the target variable changes. The "rules" the model learned are no longer true.

Example: During COVID-19, millions of legitimate users suddenly started making large online purchases from home. The patterns that previously indicated fraud (large online orders, unusual locations) became normal behavior.

Detection: Monitor prediction accuracy against delayed ground truth labels. Statistical tests on P(Y|X).

Fix: Retrain with relabeled data that reflects the new reality.

Monitoring with Evidently AI

Evidently AI is an open-source library for calculating data drift, prediction drift, and model performance metrics. It generates HTML reports and can export metrics to Prometheus/Grafana for real-time dashboards.

# pip install evidently pandas
import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, TargetDriftPreset
from evidently.metrics import (
    DatasetDriftMetric,
    DataDriftTable,
    ColumnDriftMetric,
)

# ── Load reference (training) data and current (production) data ───────
# The reference dataset is the data the model was trained on.
# The current dataset is a sample of recent production inference data.
reference_data = pd.read_parquet("data/training_data_sample.parquet")
current_data = pd.read_parquet("data/production_data_last_7_days.parquet")

# ── Generate a Data Drift Report ───────────────────────────────────────
# Evidently compares the statistical distribution of every feature
# between reference and current using appropriate statistical tests:
# - Kolmogorov-Smirnov test for numerical features
# - Chi-squared test for categorical features
# - Jensen-Shannon divergence for large datasets

drift_report = Report(metrics=[
    DatasetDriftMetric(),          # Overall: "Has the dataset drifted?"
    DataDriftTable(),              # Per-feature drift results with p-values
    ColumnDriftMetric(             # Deep dive on a critical feature
        column_name="amount"
    ),
])

drift_report.run(
    reference_data=reference_data,
    current_data=current_data,
)

# Save as an interactive HTML report
drift_report.save_html("reports/data_drift_report.html")

# ── Extract metrics programmatically for alerting ──────────────────────
result = drift_report.as_dict()
dataset_drift = result["metrics"][0]["result"]

print(f"Dataset drift detected: {dataset_drift['dataset_drift']}")
print(f"Share of drifted features: {dataset_drift['share_of_drifted_columns']:.1%}")
print(f"Number of drifted columns: {dataset_drift['number_of_drifted_columns']}")

# ── Trigger alert if drift exceeds threshold ───────────────────────────
if dataset_drift["share_of_drifted_columns"] > 0.3:  # More than 30% of features drifted
    print("⚠ ALERT: Significant data drift detected — consider retraining")
    # In production: send to PagerDuty, Slack, or trigger retrain pipeline

Prometheus + Grafana for Real-Time Monitoring

For real-time monitoring of a deployed model API, export custom metrics to Prometheus and visualize them in Grafana dashboards.

# app/metrics.py — Custom Prometheus metrics for ML model monitoring
# pip install prometheus-client
from prometheus_client import (
    Counter, Histogram, Gauge, Summary,
    generate_latest, CONTENT_TYPE_LATEST,
)
from starlette.responses import Response

# ── Define ML-specific metrics ─────────────────────────────────────────

# Count predictions by outcome — alerts if fraud ratio spikes
prediction_counter = Counter(
    "model_predictions_total",
    "Total predictions made",
    ["outcome"],               # Labels: "fraud" or "legitimate"
)

# Track prediction confidence distribution
prediction_confidence = Histogram(
    "model_prediction_confidence",
    "Distribution of model confidence scores",
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99],
)

# Track input feature distributions (for drift detection)
input_amount = Summary(
    "model_input_amount",
    "Distribution of transaction amounts seen by the model",
)

# Track inference latency
inference_latency = Histogram(
    "model_inference_duration_seconds",
    "Model inference latency",
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0],
)

# ── Usage in the FastAPI predict endpoint: ─────────────────────────────
# prediction_counter.labels(outcome="fraud").inc()
# prediction_confidence.observe(0.92)
# input_amount.observe(transaction.amount)
# with inference_latency.time():
#     result = model.predict(input_df)

# ── Expose /metrics endpoint for Prometheus scraping ───────────────────
async def metrics_endpoint(request):
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
💡
Key Grafana alerts to configure: (1) Fraud prediction ratio deviates >2σ from baseline (data drift). (2) Average prediction confidence drops below 0.6 (model uncertainty). (3) P99 inference latency exceeds 200ms (performance degradation). (4) Input feature amount mean deviates >20% from training mean.

Module 6 — Common Pitfalls & Anti-Patterns

Pitfall 1: Training-Serving Skew

The #1 Cause of Silent Model Failures critical

Training-Serving Skew occurs when the code that computes features during training is different from the code that computes features in the production serving path. The model receives features computed differently than what it was trained on, producing incorrect predictions — silently.

Example: During training, your feature transaction_velocity is computed as "number of transactions in the last 60 minutes" using a Pandas window function. In production, the API engineer reimplements it as "number of transactions in the last 3600 seconds" using a Redis counter. Rounding differences, timezone handling, or edge cases cause subtly different values — enough to degrade the model but not enough to throw an error.

Root CauseTraining CodeServing CodeImpact
Different librariesPandas .rolling()SQL WINDOWFloating point differences
Different languagesPython preprocessingJava/Go preprocessingDifferent null handling
Stale featuresFeature computed on full historyFeature computed on last 24hDistribution mismatch
Missing transformsApplies StandardScalerForgets to scaleCompletely wrong predictions

Solution: Feature Stores

A Feature Store (Feast, Tecton, Databricks Feature Store) ensures the same feature computation code is used for both training and serving. Features are computed once and stored, then served consistently to both the training pipeline and the production API.

# Using Feast — the most popular open-source Feature Store
# pip install feast

# feature_definitions.py — define features ONCE
from feast import Entity, FeatureView, Field, FileSource
from feast.types import Float64, Int64

# Entity — the "who" or "what" the features describe
user = Entity(
    name="user_id",
    join_keys=["user_id"],
    description="Unique user identifier",
)

# Feature View — a group of related features computed from the same source
user_transaction_features = FeatureView(
    name="user_transaction_features",
    entities=[user],
    schema=[
        Field(name="transaction_velocity_1h", dtype=Float64),
        Field(name="avg_amount_7d", dtype=Float64),
        Field(name="unique_merchants_30d", dtype=Int64),
    ],
    source=FileSource(
        path="data/user_features.parquet",
        timestamp_field="event_timestamp",
    ),
)

# ── Training: fetch historical features ────────────────────────────────
# from feast import FeatureStore
# store = FeatureStore(repo_path=".")
# training_df = store.get_historical_features(
#     entity_df=entity_df,  # DataFrame with user_id + timestamp
#     features=["user_transaction_features:transaction_velocity_1h",
#               "user_transaction_features:avg_amount_7d"],
# ).to_df()

# ── Serving: fetch online features (same computation!) ─────────────────
# features = store.get_online_features(
#     features=["user_transaction_features:transaction_velocity_1h",
#               "user_transaction_features:avg_amount_7d"],
#     entity_rows=[{"user_id": "user_12345"}],
# ).to_dict()

Pitfall 2: Over-Engineering V1

You Don't Need Kubernetes for Your First Model common

Teams spend 3 months building a Kubeflow + Kubernetes + Istio + Argo stack before deploying a single model. Meanwhile, the business still has zero ML in production. The first model should ship in weeks, not months.

The reality: Most ML use cases in year one can be solved with:

The Right Progression

V0: Cron + Script V1: FastAPI + Docker V2: K8s + Auto-scaling V3: Full MLOps Platform

Only move to the next level when you have evidence (traffic volume, team size, model count) that the current level is insufficient. Premature optimization is the root of all evil — in infrastructure too.

Pitfall 3: Ignoring the Baseline

Always Deploy a Baseline First important

A team spends 6 weeks building a Transformer-based model with 92% accuracy. They ship it and declare success. But nobody measured what a simple Logistic Regression achieves: 89% accuracy. The Transformer's 3% improvement may not justify 10x the inference cost and 50x the engineering complexity.

The baseline methodology:

LevelBaseline TypeExamplePurpose
0Heuristic / Rule-based"Flag if amount > $5000 and time is 2-5am"What can you achieve with zero ML?
1Simple ML modelLogistic Regression with top 10 featuresMinimum viable ML benchmark
2Moderate modelGradient Boosted Trees (XGBoost/LightGBM)Strong baseline with low complexity
3Complex modelDeep learning, TransformersOnly if Level 2 is insufficient

Deploy Level 0 first. Measure its business impact. Then deploy Level 1 and measure the incremental improvement. This prevents the scenario where a team has a great model but can't prove it's better than a simple if statement.

Best Practices Summary

Version Everything
Code (Git) + Data (DVC) + Models (MLflow) + Config (params.yaml) + Environment (Docker). If any piece is missing, you can't reproduce the model.
Automate the Pipeline
If retraining requires a human to run cells in a notebook, it will break. Use DVC Pipelines, Airflow, or GitHub Actions to make retraining a single command.
Monitor Inputs, Not Just Outputs
Ground truth labels are often delayed (fraud confirmed after 30 days). Monitor input feature distributions to detect drift before it impacts business metrics.
Ship Simple, Then Iterate
A Logistic Regression in production beats a Transformer in a notebook. Ship a baseline, measure impact, then improve. The business doesn't care about your SOTA benchmark.
Use Feature Stores for Consistency
One source of truth for feature computation eliminates training-serving skew — the #1 silent killer of model accuracy in production.
Tag Models with Git Commits
Every deployed model should be traceable back to a specific Git commit, dataset version, and training run. Use MLflow + Git tags + DVC for full lineage.

Reference Links