MLOps & Model Deployment Handbook
Production-grade patterns for versioning data, tracking experiments, deploying models, and monitoring inference — because MLOps is not just DevOps for ML. It handles a fundamentally new axis of variability: Data and Model Weights.
Table of Contents
Module 1 — Versioning Data & Models
The Problem: Why Git Fails for ML Artifacts
Git was designed for text files — source code, configs, markdown. It stores every version of every file in the .git directory. When you try to commit a 50GB training dataset or a 2GB .pt PyTorch checkpoint, three things go wrong immediately:
| Problem | What Happens | Impact |
|---|---|---|
| Repository bloat | Every version of the dataset is stored in .git/objects | A 50GB dataset with 10 versions → 500GB repo |
| Clone times | git clone downloads the entire history | New team members wait hours to clone |
| GitHub/GitLab limits | File size limits (100MB GitHub, 5GB GitLab LFS) | Push rejected; workflow broken |
| No diffing | Git can't meaningfully diff binary files | git log becomes useless for data changes |
git add training_data.parquet on a 12GB file. Git computes a SHA-1 hash of the entire file, copies it into .git/objects, and now the repo is permanently 12GB larger — even if you delete the file in the next commit. You'd need git filter-branch or BFG Repo Cleaner to fix it.Git LFS — A Partial Solution (and Why It's Not Enough)
Git Large File Storage (LFS) replaces large files with lightweight pointers inside the Git repo, while storing the actual file contents on a remote server. This helps — but it has critical limitations for ML:
- No data pipeline integration: LFS doesn't know about your training pipeline or which dataset version produced which model.
- Costly hosting: GitHub LFS bandwidth and storage have strict quotas and overage charges.
- No experiment linking: LFS can't connect a specific dataset version to a specific experiment result.
DVC — Data Version Control
DVC (Data Version Control) solves this by working alongside Git. The architecture is elegant:
Git stores: Small pointer files (.dvc files) containing the MD5 hash of the actual data.
DVC stores: The actual heavy files (datasets, model weights) in a configured remote (S3, GCS, Azure Blob, local NAS).
Result: Your repo stays lightweight. Your data is versioned, shareable, and reproducible.
Step-by-Step: Initialize DVC in a Project
# Step 1: Create a new project and initialize Git + DVC mkdir ml-fraud-detection && cd ml-fraud-detection git init pip install dvc[s3] # Install DVC with S3 support (also: dvc[gs], dvc[azure]) dvc init # Creates .dvc/ directory and .dvcignore # DVC init creates these files automatically: # .dvc/.gitignore — prevents DVC cache from entering Git # .dvc/config — DVC configuration # .dvcignore — like .gitignore but for DVC git add .dvc .dvcignore git commit -m "Initialize DVC"
Step 2: Add a Large Dataset to DVC
# Assume we have a 12GB raw dataset ls -lh data/ # data/transactions_2024.parquet 12GB # data/labels.csv 450MB # Track the entire data/ directory with DVC dvc add data/ # This creates two things: # 1. data.dvc — a small pointer file (YAML) with the MD5 hash # 2. data/.gitignore — prevents Git from tracking the actual data files # Let's inspect the pointer file: cat data.dvc # outs: # - md5: a1b2c3d4e5f6... (hash of the data directory) # size: 12884901888 # nfiles: 2 # path: data # Now commit the POINTER file to Git (not the data itself) git add data.dvc data/.gitignore git commit -m "Track training data v1 with DVC"
Step 3: Configure an S3 Remote and Push
# Configure S3 as the DVC remote storage dvc remote add -d myremote s3://my-mlops-bucket/dvc-store # -d flag makes this the default remote # Optional: set S3 region and endpoint (for MinIO or other S3-compatible stores) dvc remote modify myremote region us-east-1 # Push the actual data files to S3 dvc push # Uploading data/transactions_2024.parquet to s3://my-mlops-bucket/dvc-store/... # 2 files pushed # Commit the remote configuration git add .dvc/config git commit -m "Configure S3 remote for DVC storage" git push origin main
Step 4: Reproduce on Another Machine
# A teammate clones the repo — it's fast because no heavy files in Git git clone https://github.com/team/ml-fraud-detection.git cd ml-fraud-detection # Pull the actual data from S3 using the pointer files dvc pull # Downloading data/transactions_2024.parquet from s3://... # 2 files fetched, 2 files updated # The data directory is now identical to the original — same hash, same bytes
Versioning Model Weights
# After training, track the model checkpoint dvc add models/fraud_detector_v1.pt # 2.1GB PyTorch model git add models/fraud_detector_v1.pt.dvc models/.gitignore git commit -m "Add trained model v1 — F1=0.87, AUC=0.94" git tag -a "model-v1" -m "Baseline fraud detector" dvc push # Later, retrain with new data → track v2 dvc add models/fraud_detector_v2.pt git add models/fraud_detector_v2.pt.dvc git commit -m "Model v2 — F1=0.91, AUC=0.96 (added transaction velocity features)" git tag -a "model-v2" -m "Improved with velocity features" dvc push # Roll back to v1 if v2 degrades in production git checkout model-v1 dvc checkout # Restores the v1 model weights from cache/remote
git checkout any tag and dvc checkout to get the exact data and model from that point in time. This is the foundation of ML reproducibility.DVC Commands Cheat Sheet
| Command | Purpose | Git Equivalent |
|---|---|---|
dvc init | Initialize DVC in a Git repo | git init |
dvc add <file> | Start tracking a file/directory | git add |
dvc push | Upload tracked files to remote | git push |
dvc pull | Download tracked files from remote | git pull |
dvc checkout | Sync data files to match current Git commit | git checkout |
dvc diff | Show changes in tracked data | git diff |
dvc remote add | Configure storage backend | git remote add |
dvc gc | Garbage collect unused cache | git gc |
Module 2 — Experiment Tracking & Model Registry
The Messy Notebook Problem
Every data science team has experienced this: a shared drive or repo full of files named model_v1.ipynb, model_v2_fixed.ipynb, model_v3_FINAL.ipynb, model_v3_FINAL_actually_final.ipynb. Inside each notebook, hyperparameters are hardcoded, results are printed to stdout, and there's no way to compare runs systematically.
Without experiment tracking, you lose: (1) which hyperparameters produced which metrics, (2) which dataset version was used, (3) which code version was run, (4) system metrics (GPU utilization, memory), and (5) the ability to compare 50 runs side-by-side.
Experiment tracking tools solve this by automatically logging every detail of every training run into a centralized database with a UI for comparison, filtering, and analysis.
Weights & Biases (W&B)
W&B is a hosted experiment tracking platform that provides real-time dashboards, hyperparameter sweeps, model versioning, and team collaboration. It's the industry standard for experiment tracking in research and production ML teams.
W&B Integration with a Scikit-Learn Training Loop
# pip install wandb scikit-learn pandas import wandb import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, roc_auc_score, classification_report import joblib # ── Step 1: Initialize a W&B run ────────────────────────────────────── # Every call to wandb.init() creates a new "run" in your W&B project. # All logs, metrics, and artifacts are grouped under this run. run = wandb.init( project="fraud-detection", # Groups runs under one project dashboard name="rf-baseline-v1", # Human-readable name for this specific run config={ # Hyperparameters — logged and searchable "model_type": "RandomForest", "n_estimators": 200, "max_depth": 15, "min_samples_split": 5, "class_weight": "balanced", # Important for imbalanced fraud data "dataset_version": "v2.1", "feature_count": 47, "test_size": 0.2, }, tags=["baseline", "sklearn", "fraud"], ) # ── Step 2: Load data and split ──────────────────────────────────────── df = pd.read_parquet("data/transactions_2024.parquet") X = df.drop(columns=["is_fraud"]) y = df["is_fraud"] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=run.config["test_size"], stratify=y, random_state=42 ) # Log dataset statistics — helps debug data issues later wandb.log({ "dataset/total_samples": len(df), "dataset/fraud_ratio": y.mean(), "dataset/train_size": len(X_train), "dataset/test_size": len(X_test), }) # ── Step 3: Train the model ──────────────────────────────────────────── model = RandomForestClassifier( n_estimators=run.config["n_estimators"], max_depth=run.config["max_depth"], min_samples_split=run.config["min_samples_split"], class_weight=run.config["class_weight"], random_state=42, n_jobs=-1, # Use all CPU cores ) model.fit(X_train, y_train) # ── Step 4: Evaluate and log metrics ─────────────────────────────────── y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] f1 = f1_score(y_test, y_pred) auc = roc_auc_score(y_test, y_prob) # wandb.log() sends metrics to the dashboard in real-time wandb.log({ "metrics/f1_score": f1, "metrics/auc_roc": auc, "metrics/accuracy": model.score(X_test, y_test), }) # Log feature importances as a W&B bar chart feature_importance = pd.DataFrame({ "feature": X.columns, "importance": model.feature_importances_, }).sort_values("importance", ascending=False).head(20) wandb.log({ "feature_importances": wandb.Table(dataframe=feature_importance) }) # ── Step 5: Save model as W&B Artifact ───────────────────────────────── # Artifacts provide versioned, immutable model storage joblib.dump(model, "models/rf_fraud_v1.joblib") artifact = wandb.Artifact( name="fraud-detector", type="model", description="Random Forest baseline for fraud detection", metadata={"f1": f1, "auc": auc}, ) artifact.add_file("models/rf_fraud_v1.joblib") run.log_artifact(artifact) # ── Step 6: Close the run ────────────────────────────────────────────── run.finish() print(f"Run complete — F1: {f1:.4f}, AUC: {auc:.4f}")
W&B with a PyTorch Training Loop
# pip install wandb torch import wandb import torch import torch.nn as nn from torch.utils.data import DataLoader # Initialize with PyTorch-specific config run = wandb.init( project="fraud-detection", name="mlp-v1", config={ "model_type": "MLP", "hidden_dim": 256, "learning_rate": 1e-3, "epochs": 50, "batch_size": 512, "optimizer": "AdamW", "weight_decay": 1e-4, }, ) # ... (model definition and data loading omitted for brevity) ... # Training loop with W&B logging for epoch in range(run.config["epochs"]): model.train() epoch_loss = 0.0 for batch_X, batch_y in train_loader: optimizer.zero_grad() preds = model(batch_X) loss = criterion(preds, batch_y) loss.backward() optimizer.step() epoch_loss += loss.item() avg_loss = epoch_loss / len(train_loader) # Log per-epoch metrics — these appear as interactive charts in W&B wandb.log({ "epoch": epoch, "train/loss": avg_loss, "train/learning_rate": optimizer.param_groups[0]["lr"], }) # Validation every 5 epochs if epoch % 5 == 0: val_loss, val_f1 = evaluate(model, val_loader) # Custom eval function wandb.log({ "val/loss": val_loss, "val/f1_score": val_f1, }) # wandb.watch() — auto-logs gradients and parameter histograms # Call this BEFORE the training loop for full gradient tracking # wandb.watch(model, log="all", log_freq=100) run.finish()
MLflow — Model Registry & Lifecycle
While W&B excels at experiment tracking and visualization, MLflow is the open-source standard for the model lifecycle: logging models, versioning them, and transitioning them through stages (Staging → Production → Archived). Many teams use both: W&B for experiment dashboards and MLflow for model registry and deployment.
| Capability | Weights & Biases | MLflow |
|---|---|---|
| Experiment tracking UI | Best-in-class dashboards and comparison | Good, but more basic |
| Model Registry | Artifact-based, less structured | First-class with staging/production |
| Hosting | SaaS (hosted) — free tier available | Self-hosted (or Databricks managed) |
| Cost | Free for individuals; paid for teams | Free and open-source |
| Deployment | Not a deployment tool | Built-in mlflow models serve |
| Best for | Research teams, rapid prototyping | Production pipelines, model governance |
MLflow: Log, Register, and Promote a Model
# pip install mlflow scikit-learn import mlflow import mlflow.sklearn from sklearn.ensemble import GradientBoostingClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score, roc_auc_score import pandas as pd # ── Step 1: Set the tracking URI ─────────────────────────────────────── # MLflow stores runs in a "tracking server" — local or remote mlflow.set_tracking_uri("http://mlflow-server.internal:5000") # Or "sqlite:///mlflow.db" for local mlflow.set_experiment("fraud-detection") # ── Step 2: Start a run and log everything ───────────────────────────── with mlflow.start_run(run_name="gbm-tuned-v3") as run: # Log hyperparameters params = { "n_estimators": 300, "max_depth": 8, "learning_rate": 0.05, "subsample": 0.8, "min_samples_leaf": 20, } mlflow.log_params(params) # Load data and train df = pd.read_parquet("data/transactions_2024.parquet") X_train, X_test, y_train, y_test = train_test_split( df.drop(columns=["is_fraud"]), df["is_fraud"], test_size=0.2, stratify=df["is_fraud"], random_state=42 ) model = GradientBoostingClassifier(**params, random_state=42) model.fit(X_train, y_train) # Evaluate and log metrics y_pred = model.predict(X_test) y_prob = model.predict_proba(X_test)[:, 1] metrics = { "f1_score": f1_score(y_test, y_pred), "auc_roc": roc_auc_score(y_test, y_prob), "accuracy": model.score(X_test, y_test), } mlflow.log_metrics(metrics) # ── Step 3: Log the trained model ────────────────────────────────── # mlflow.sklearn.log_model serializes the model with its conda/pip # environment, so it can be loaded anywhere without dependency issues mlflow.sklearn.log_model( sk_model=model, artifact_path="model", # Subdirectory in the run artifacts registered_model_name="fraud-detector", # Auto-registers in Model Registry input_example=X_test.iloc[:5], # Saves a sample for schema inference ) # Log additional artifacts (feature importance plot, confusion matrix, etc.) mlflow.log_artifact("reports/confusion_matrix.png") print(f"Run ID: {run.info.run_id}") print(f"F1: {metrics['f1_score']:.4f}, AUC: {metrics['auc_roc']:.4f}")
Transitioning Model Stages in the Registry
# pip install mlflow from mlflow.tracking import MlflowClient client = MlflowClient(tracking_uri="http://mlflow-server.internal:5000") # ── List all versions of the "fraud-detector" model ──────────────────── for mv in client.search_model_versions("name='fraud-detector'"): print(f"Version {mv.version}: stage={mv.current_stage}, run_id={mv.run_id}") # ── Promote version 3 to Staging for A/B testing ────────────────────── client.transition_model_version_stage( name="fraud-detector", version=3, stage="Staging", archive_existing_versions=False, # Keep current Staging version for comparison ) # ── After validation, promote to Production ──────────────────────────── client.transition_model_version_stage( name="fraud-detector", version=3, stage="Production", archive_existing_versions=True, # Archive previous Production version ) # ── Load the Production model anywhere in your codebase ──────────────── import mlflow.pyfunc model = mlflow.pyfunc.load_model("models:/fraud-detector/Production") predictions = model.predict(new_data) # Works with any MLflow-logged model flavor
Module 3 — Continuous Training & Pipelines
From Monolithic Script to DAG
Most ML projects start as a single Python script or Jupyter notebook: load data, preprocess, train, evaluate — all in one file. This seems simple, but it has a fundamental problem: when you change one thing, you have to re-run everything.
Imagine your training pipeline takes 4 hours. You fix a bug in the evaluation metric. Without a pipeline, you rerun all 4 hours. With a pipeline, only the evaluation step reruns (2 minutes).
An ML pipeline is a DAG — a directed acyclic graph of steps where each step declares its dependencies (inputs) and outputs. The pipeline runner (DVC, Airflow, Kubeflow) computes which steps need rerunning based on what changed.
Each step is a standalone script with clearly defined inputs and outputs. If you change feature_eng.py, only feature_eng, train, and evaluate rerun. data_prep is skipped because its inputs haven't changed.
DVC Pipelines — dvc.yaml
DVC Pipelines define your ML workflow in a dvc.yaml file. Each stage declares its command, dependencies (scripts + data), parameters, and outputs. DVC computes MD5 hashes of everything and only reruns stages where something changed.
Complete dvc.yaml Pipeline
# dvc.yaml — defines the full ML training pipeline stages: # Stage 1: Data Preparation data_prep: cmd: python src/data_prep.py deps: # If any dependency changes, this stage reruns - src/data_prep.py # The script itself - data/raw/ # Raw data directory params: # Read from params.yaml — changes trigger rerun - data_prep.test_size - data_prep.random_seed outs: # Stage outputs — cached and tracked by DVC - data/processed/train.parquet - data/processed/test.parquet # Stage 2: Feature Engineering feature_engineering: cmd: python src/feature_eng.py deps: - src/feature_eng.py - data/processed/train.parquet - data/processed/test.parquet params: - features.velocity_window - features.amount_bins outs: - data/features/train_features.parquet - data/features/test_features.parquet # Stage 3: Model Training train: cmd: python src/train.py deps: - src/train.py - data/features/train_features.parquet params: - train.n_estimators - train.max_depth - train.learning_rate outs: - models/model.joblib # Trained model — cached for reproduction # Stage 4: Evaluation evaluate: cmd: python src/evaluate.py deps: - src/evaluate.py - models/model.joblib - data/features/test_features.parquet metrics: # Special: tracked by DVC for comparison - reports/metrics.json: cache: false # Don't cache — always regenerate plots: # Generates plots viewable with dvc plots - reports/confusion_matrix.csv: x: predicted y: actual
The params.yaml Configuration File
# params.yaml — single source of truth for all hyperparameters data_prep: test_size: 0.2 random_seed: 42 features: velocity_window: 3600 # Seconds — transaction velocity feature amount_bins: 10 # Number of bins for amount discretization train: n_estimators: 300 max_depth: 8 learning_rate: 0.05
Running and Reproducing the Pipeline
# Run the full pipeline — DVC determines what needs (re)running dvc repro # Output (first run — everything runs): # Running stage 'data_prep': # > python src/data_prep.py # Running stage 'feature_engineering': # > python src/feature_eng.py # Running stage 'train': # > python src/train.py # Running stage 'evaluate': # > python src/evaluate.py # ── Now change a hyperparameter: ─────────────────────────────────────── # Edit params.yaml → train.n_estimators: 500 dvc repro # Output (only affected stages rerun): # Stage 'data_prep' didn't change, skipping # Stage 'feature_engineering' didn't change, skipping # Running stage 'train': ← only train + evaluate rerun # > python src/train.py # Running stage 'evaluate': # > python src/evaluate.py # ── Compare metrics across experiments: ──────────────────────────────── dvc metrics diff # Path Metric Old New # reports/metrics f1_score 0.87 0.91 # reports/metrics auc_roc 0.94 0.96 # ── View plots: ──────────────────────────────────────────────────────── dvc plots show reports/confusion_matrix.csv
dvc.lock. On dvc repro, it recomputes hashes and only reruns stages where at least one dependency hash has changed. This is the same principle as make — but for ML.Triggering Pipelines Automatically (CI/CD)
# .github/workflows/retrain.yml — GitHub Actions pipeline trigger name: Retrain Model on: schedule: - cron: "0 2 * * 1" # Every Monday at 2am — weekly retraining push: paths: - "data/**" # Retrain when data changes - "src/**" # Retrain when code changes - "params.yaml" # Retrain when hyperparams change jobs: retrain: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: "3.11" - run: pip install -r requirements.txt - run: dvc pull # Fetch data from S3 - run: dvc repro # Run the pipeline - run: dvc push # Push new outputs back to S3 - run: | git add dvc.lock reports/metrics.json git commit -m "[CI] Retrain model — $(date +%Y-%m-%d)" git push
Module 4 — Deployment Environments & Model Serving
A trained model sitting in a .joblib file is worthless to the business. It needs to be served — accessible to other systems to make predictions. There are four primary deployment patterns, each suited to different latency, throughput, and cost requirements.
Real-Time Serving with FastAPI
FastAPI is the standard Python framework for serving ML models via REST APIs. It's fast (built on Starlette + Uvicorn), auto-generates OpenAPI docs, and supports Pydantic for request/response validation — preventing malformed input from reaching your model.
Production FastAPI Model Server
# app/main.py — FastAPI model serving endpoint # pip install fastapi uvicorn mlflow pydantic import os import logging from contextlib import asynccontextmanager import mlflow.pyfunc import pandas as pd from fastapi import FastAPI, HTTPException from pydantic import BaseModel, Field logger = logging.getLogger(__name__) # ── Request / Response schemas ───────────────────────────────────────── # Pydantic validates every incoming request automatically. # If a field is missing or the wrong type, FastAPI returns a 422 error. class TransactionInput(BaseModel): amount: float = Field(gt=0, description="Transaction amount in USD") merchant_category: str = Field(description="MCC code (e.g., 'grocery', 'travel')") hour_of_day: int = Field(ge=0, le=23) day_of_week: int = Field(ge=0, le=6) transaction_velocity: float = Field(ge=0, description="Transactions in last hour") distance_from_home: float = Field(ge=0, description="Miles from user's home address") class PredictionOutput(BaseModel): is_fraud: bool fraud_probability: float model_version: str # ── Model loading at startup ─────────────────────────────────────────── # Load the model ONCE at startup, not per-request. # The lifespan context manager ensures cleanup on shutdown. ml_models = {} @asynccontextmanager async def lifespan(app: FastAPI): # Load model from MLflow Model Registry — "Production" stage model_uri = os.getenv("MODEL_URI", "models:/fraud-detector/Production") logger.info(f"Loading model from: {model_uri}") ml_models["fraud_detector"] = mlflow.pyfunc.load_model(model_uri) ml_models["model_version"] = model_uri logger.info("Model loaded successfully") yield # Cleanup on shutdown ml_models.clear() app = FastAPI( title="Fraud Detection API", version="1.0.0", lifespan=lifespan, ) # ── Health check endpoint ────────────────────────────────────────────── @app.get("/health") async def health(): return {"status": "healthy", "model_loaded": "fraud_detector" in ml_models} # ── Prediction endpoint ──────────────────────────────────────────────── @app.post("/predict", response_model=PredictionOutput) async def predict(transaction: TransactionInput): try: # Convert the validated Pydantic model to a DataFrame # MLflow models expect pandas DataFrames as input input_df = pd.DataFrame([transaction.model_dump()]) # Get prediction — model.predict returns numpy array model = ml_models["fraud_detector"] probability = float(model.predict(input_df)[0]) return PredictionOutput( is_fraud=probability > 0.5, fraud_probability=round(probability, 4), model_version=ml_models["model_version"], ) except Exception as e: logger.error(f"Prediction error: {e}") raise HTTPException(status_code=500, detail="Prediction failed") # ── Run with: uvicorn app.main:app --host 0.0.0.0 --port 8000 ─────────
Testing the API
# Test with curl curl -X POST http://localhost:8000/predict \ -H "Content-Type: application/json" \ -d '{ "amount": 4999.99, "merchant_category": "electronics", "hour_of_day": 3, "day_of_week": 2, "transaction_velocity": 12.0, "distance_from_home": 850.5 }' # Response: # { # "is_fraud": true, # "fraud_probability": 0.9213, # "model_version": "models:/fraud-detector/Production" # }
Containerization — Production Dockerfile
Packaging your FastAPI model server into a Docker container ensures it runs identically everywhere — your laptop, staging, production, Kubernetes. The key challenge with ML containers is image size: PyTorch + CUDA can exceed 8GB. Use multi-stage builds and CPU-only variants to keep images lean.
Production Dockerfile for ML
# Dockerfile — Production ML model server # Multi-stage build to minimize final image size # ── Stage 1: Builder (install dependencies) ──────────────────────────── FROM python:3.11-slim AS builder # Prevents Python from writing .pyc files and enables unbuffered output ENV PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 WORKDIR /build # Install dependencies in a virtual environment for clean copying RUN python -m venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" # Copy and install requirements FIRST (Docker layer caching) # This layer is cached unless requirements.txt changes COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # ── Stage 2: Runtime (minimal image) ─────────────────────────────────── FROM python:3.11-slim AS runtime ENV PYTHONDONTWRITEBYTECODE=1 \ PYTHONUNBUFFERED=1 # Copy the virtual environment from builder COPY --from=builder /opt/venv /opt/venv ENV PATH="/opt/venv/bin:$PATH" # Create a non-root user for security RUN groupadd -r mluser && useradd -r -g mluser mluser WORKDIR /app # Copy application code COPY app/ ./app/ COPY models/ ./models/ # Switch to non-root user USER mluser # Expose the port EXPOSE 8000 # Health check — Docker/K8s uses this to determine container health HEALTHCHECK --interval=30s --timeout=5s --retries=3 \ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" # Run with Uvicorn — 4 worker processes for production CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
requirements.txt (CPU-optimized)
# requirements.txt — use CPU-only PyTorch to save ~5GB in image size fastapi==0.115.0 uvicorn[standard]==0.30.0 mlflow==2.16.0 pandas==2.2.0 scikit-learn==1.5.0 pydantic==2.9.0 # For PyTorch models, use the CPU-only variant: # --extra-index-url https://download.pytorch.org/whl/cpu # torch==2.4.0+cpu
Build and Run
# Build the image (tag with model version for traceability) docker build -t fraud-detector:v3-2026.03 . # Run locally for testing docker run -p 8000:8000 \ -e MODEL_URI="models:/fraud-detector/Production" \ -e MLFLOW_TRACKING_URI="http://mlflow-server:5000" \ fraud-detector:v3-2026.03 # Check image size — target under 1GB for sklearn models docker images fraud-detector # REPOSITORY TAG SIZE # fraud-detector v3-2026.03 487MB ← Good: sklearn + FastAPI
python:3.11-slim (not python:3.11), install CPU-only variants of PyTorch/TensorFlow, and use multi-stage builds. Target: <500MB for sklearn, <2GB for PyTorch CPU.Batch Predictions
Not every prediction needs to happen in real-time. Batch inference — scoring millions of records on a schedule — is often cheaper, simpler, and more reliable than maintaining a real-time API.
| Aspect | Real-Time API | Batch Predictions |
|---|---|---|
| Latency | <100ms per request | Minutes to hours for entire dataset |
| Infrastructure | Always-on servers, load balancers, auto-scaling | Spin up, score, spin down (ephemeral) |
| Cost | 24/7 compute costs | Pay only when the job runs |
| Error handling | Must handle failures in real-time | Retry the entire job if it fails |
| Use cases | Fraud at checkout, chatbots | Daily churn scoring, weekly recommendations, marketing segmentation |
PySpark Batch Scoring Example
# batch_score.py — Score 10M users with PySpark # Runs as a scheduled Databricks job or Airflow task import mlflow.pyfunc from pyspark.sql import SparkSession from pyspark.sql.functions import pandas_udf import pandas as pd spark = SparkSession.builder.appName("batch-fraud-scoring").getOrCreate() # Load the Production model from MLflow model = mlflow.pyfunc.load_model("models:/fraud-detector/Production") # Load the user transaction data (e.g., from a Delta Lake / S3 path) transactions_df = spark.read.parquet("s3://data-lake/transactions/2026-03-30/") # Define a Pandas UDF for distributed scoring across the Spark cluster @pandas_udf("double") def score_batch(amount: pd.Series, velocity: pd.Series, distance: pd.Series) -> pd.Series: # This function runs on each Spark executor in parallel input_df = pd.DataFrame({ "amount": amount, "transaction_velocity": velocity, "distance_from_home": distance, }) return pd.Series(model.predict(input_df)) # Score all 10M rows — distributed across the Spark cluster scored_df = transactions_df.withColumn( "fraud_probability", score_batch("amount", "transaction_velocity", "distance_from_home") ) # Write results back to the data lake for downstream consumers scored_df.write.mode("overwrite").parquet("s3://data-lake/scores/fraud/2026-03-30/")
Edge & Browser Deployment
For ultra-low-latency or offline scenarios, models can be deployed directly to edge devices or web browsers:
| Target | Format | Tool | Use Case |
|---|---|---|---|
| Mobile (Android/iOS) | TFLite, Core ML | TensorFlow Lite, Core ML Tools | On-device image classification, voice commands |
| IoT / Embedded | ONNX, TFLite Micro | ONNX Runtime, TF Micro | Sensor anomaly detection, predictive maintenance |
| Web Browser | ONNX, TF.js | Transformers.js, ONNX Runtime Web | Client-side text classification, image segmentation |
# Convert a PyTorch model to ONNX for cross-platform deployment import torch # Assume `model` is a trained PyTorch model and `dummy_input` matches input shape dummy_input = torch.randn(1, 47) # Batch of 1, 47 features torch.onnx.export( model, dummy_input, "models/fraud_detector.onnx", input_names=["features"], output_names=["fraud_probability"], dynamic_axes={"features": {0: "batch_size"}}, # Allow variable batch size opset_version=17, ) # The .onnx file can now run on: # - ONNX Runtime (Python, C++, C#, Java) # - ONNX Runtime Web (browser via WebAssembly) # - ONNX Runtime Mobile (Android, iOS)
Module 5 — Model Monitoring & Observability
Silent Failures: The Invisible Problem
When a web server fails, it crashes. You get a 500 error, a stack trace, and an alert fires at 3am. When an ML model fails, nothing crashes. The model still returns predictions — they're just wrong. A fraud detection model might start approving fraudulent transactions. A recommendation engine might start surfacing irrelevant products. The API returns 200 OK every time.
Traditional monitoring (latency, error rates, CPU) catches infrastructure failures. But an ML model can have perfect latency and zero errors while its predictions are completely wrong. You need statistical monitoring — tracking the distribution of inputs and outputs against the training baseline.
This is why ML monitoring is fundamentally different from software monitoring. You need to track:
- Input distributions: Are the features the model receives in production similar to training data?
- Output distributions: Is the model's confidence/prediction distribution shifting?
- Ground truth feedback: When labels become available (delayed), has actual model accuracy degraded?
Data Drift vs. Concept Drift
These are the two primary modes of ML model degradation. Understanding the difference is essential for building the right monitoring and retraining strategy.
Example: Your fraud model was trained when the average transaction was $50. After a holiday season, the average jumps to $200. The model receives inputs it rarely saw during training.
Detection: Compare input feature distributions (mean, std, quantiles) between training data and live production data.
Fix: Retrain on recent data that includes the new distribution.
Example: During COVID-19, millions of legitimate users suddenly started making large online purchases from home. The patterns that previously indicated fraud (large online orders, unusual locations) became normal behavior.
Detection: Monitor prediction accuracy against delayed ground truth labels. Statistical tests on P(Y|X).
Fix: Retrain with relabeled data that reflects the new reality.
Monitoring with Evidently AI
Evidently AI is an open-source library for calculating data drift, prediction drift, and model performance metrics. It generates HTML reports and can export metrics to Prometheus/Grafana for real-time dashboards.
# pip install evidently pandas import pandas as pd from evidently.report import Report from evidently.metric_preset import DataDriftPreset, TargetDriftPreset from evidently.metrics import ( DatasetDriftMetric, DataDriftTable, ColumnDriftMetric, ) # ── Load reference (training) data and current (production) data ─────── # The reference dataset is the data the model was trained on. # The current dataset is a sample of recent production inference data. reference_data = pd.read_parquet("data/training_data_sample.parquet") current_data = pd.read_parquet("data/production_data_last_7_days.parquet") # ── Generate a Data Drift Report ─────────────────────────────────────── # Evidently compares the statistical distribution of every feature # between reference and current using appropriate statistical tests: # - Kolmogorov-Smirnov test for numerical features # - Chi-squared test for categorical features # - Jensen-Shannon divergence for large datasets drift_report = Report(metrics=[ DatasetDriftMetric(), # Overall: "Has the dataset drifted?" DataDriftTable(), # Per-feature drift results with p-values ColumnDriftMetric( # Deep dive on a critical feature column_name="amount" ), ]) drift_report.run( reference_data=reference_data, current_data=current_data, ) # Save as an interactive HTML report drift_report.save_html("reports/data_drift_report.html") # ── Extract metrics programmatically for alerting ────────────────────── result = drift_report.as_dict() dataset_drift = result["metrics"][0]["result"] print(f"Dataset drift detected: {dataset_drift['dataset_drift']}") print(f"Share of drifted features: {dataset_drift['share_of_drifted_columns']:.1%}") print(f"Number of drifted columns: {dataset_drift['number_of_drifted_columns']}") # ── Trigger alert if drift exceeds threshold ─────────────────────────── if dataset_drift["share_of_drifted_columns"] > 0.3: # More than 30% of features drifted print("⚠ ALERT: Significant data drift detected — consider retraining") # In production: send to PagerDuty, Slack, or trigger retrain pipeline
Prometheus + Grafana for Real-Time Monitoring
For real-time monitoring of a deployed model API, export custom metrics to Prometheus and visualize them in Grafana dashboards.
# app/metrics.py — Custom Prometheus metrics for ML model monitoring # pip install prometheus-client from prometheus_client import ( Counter, Histogram, Gauge, Summary, generate_latest, CONTENT_TYPE_LATEST, ) from starlette.responses import Response # ── Define ML-specific metrics ───────────────────────────────────────── # Count predictions by outcome — alerts if fraud ratio spikes prediction_counter = Counter( "model_predictions_total", "Total predictions made", ["outcome"], # Labels: "fraud" or "legitimate" ) # Track prediction confidence distribution prediction_confidence = Histogram( "model_prediction_confidence", "Distribution of model confidence scores", buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99], ) # Track input feature distributions (for drift detection) input_amount = Summary( "model_input_amount", "Distribution of transaction amounts seen by the model", ) # Track inference latency inference_latency = Histogram( "model_inference_duration_seconds", "Model inference latency", buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0], ) # ── Usage in the FastAPI predict endpoint: ───────────────────────────── # prediction_counter.labels(outcome="fraud").inc() # prediction_confidence.observe(0.92) # input_amount.observe(transaction.amount) # with inference_latency.time(): # result = model.predict(input_df) # ── Expose /metrics endpoint for Prometheus scraping ─────────────────── async def metrics_endpoint(request): return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
amount mean deviates >20% from training mean.Module 6 — Common Pitfalls & Anti-Patterns
Pitfall 1: Training-Serving Skew
Training-Serving Skew occurs when the code that computes features during training is different from the code that computes features in the production serving path. The model receives features computed differently than what it was trained on, producing incorrect predictions — silently.
Example: During training, your feature transaction_velocity is computed as "number of transactions in the last 60 minutes" using a Pandas window function. In production, the API engineer reimplements it as "number of transactions in the last 3600 seconds" using a Redis counter. Rounding differences, timezone handling, or edge cases cause subtly different values — enough to degrade the model but not enough to throw an error.
| Root Cause | Training Code | Serving Code | Impact |
|---|---|---|---|
| Different libraries | Pandas .rolling() | SQL WINDOW | Floating point differences |
| Different languages | Python preprocessing | Java/Go preprocessing | Different null handling |
| Stale features | Feature computed on full history | Feature computed on last 24h | Distribution mismatch |
| Missing transforms | Applies StandardScaler | Forgets to scale | Completely wrong predictions |
Solution: Feature Stores
A Feature Store (Feast, Tecton, Databricks Feature Store) ensures the same feature computation code is used for both training and serving. Features are computed once and stored, then served consistently to both the training pipeline and the production API.
# Using Feast — the most popular open-source Feature Store # pip install feast # feature_definitions.py — define features ONCE from feast import Entity, FeatureView, Field, FileSource from feast.types import Float64, Int64 # Entity — the "who" or "what" the features describe user = Entity( name="user_id", join_keys=["user_id"], description="Unique user identifier", ) # Feature View — a group of related features computed from the same source user_transaction_features = FeatureView( name="user_transaction_features", entities=[user], schema=[ Field(name="transaction_velocity_1h", dtype=Float64), Field(name="avg_amount_7d", dtype=Float64), Field(name="unique_merchants_30d", dtype=Int64), ], source=FileSource( path="data/user_features.parquet", timestamp_field="event_timestamp", ), ) # ── Training: fetch historical features ──────────────────────────────── # from feast import FeatureStore # store = FeatureStore(repo_path=".") # training_df = store.get_historical_features( # entity_df=entity_df, # DataFrame with user_id + timestamp # features=["user_transaction_features:transaction_velocity_1h", # "user_transaction_features:avg_amount_7d"], # ).to_df() # ── Serving: fetch online features (same computation!) ───────────────── # features = store.get_online_features( # features=["user_transaction_features:transaction_velocity_1h", # "user_transaction_features:avg_amount_7d"], # entity_rows=[{"user_id": "user_12345"}], # ).to_dict()
Pitfall 2: Over-Engineering V1
Teams spend 3 months building a Kubeflow + Kubernetes + Istio + Argo stack before deploying a single model. Meanwhile, the business still has zero ML in production. The first model should ship in weeks, not months.
The reality: Most ML use cases in year one can be solved with:
- Batch predictions: A cron job running a Python script on a single EC2 instance that writes predictions to a database.
- Simple API: A single FastAPI + Docker container behind an Application Load Balancer.
- Managed services: SageMaker endpoints, Vertex AI endpoints, Azure ML — zero infrastructure management.
The Right Progression
Only move to the next level when you have evidence (traffic volume, team size, model count) that the current level is insufficient. Premature optimization is the root of all evil — in infrastructure too.
Pitfall 3: Ignoring the Baseline
A team spends 6 weeks building a Transformer-based model with 92% accuracy. They ship it and declare success. But nobody measured what a simple Logistic Regression achieves: 89% accuracy. The Transformer's 3% improvement may not justify 10x the inference cost and 50x the engineering complexity.
The baseline methodology:
| Level | Baseline Type | Example | Purpose |
|---|---|---|---|
| 0 | Heuristic / Rule-based | "Flag if amount > $5000 and time is 2-5am" | What can you achieve with zero ML? |
| 1 | Simple ML model | Logistic Regression with top 10 features | Minimum viable ML benchmark |
| 2 | Moderate model | Gradient Boosted Trees (XGBoost/LightGBM) | Strong baseline with low complexity |
| 3 | Complex model | Deep learning, Transformers | Only if Level 2 is insufficient |
Deploy Level 0 first. Measure its business impact. Then deploy Level 1 and measure the incremental improvement. This prevents the scenario where a team has a great model but can't prove it's better than a simple if statement.
Best Practices Summary
Reference Links
- DVC dvc.org/doc — Official Documentation
- MLflow mlflow.org — MLflow Docs
- W&B docs.wandb.ai — Weights & Biases Documentation
- FastAPI fastapi.tiangolo.com — FastAPI Documentation
- Evidently docs.evidentlyai.com — Evidently AI Docs
- Feast docs.feast.dev — Feast Feature Store
- ONNX onnxruntime.ai — ONNX Runtime
- Docker Docker Multi-Stage Builds
- Paper Hidden Technical Debt in ML Systems (Google, NeurIPS)
- Book Designing Machine Learning Systems — Chip Huyen