Back to handbooks index

PyTorch: Beginner to Intermediate Handbook

A developer-focused guide to tensors, autograd, model construction, data pipelines, explicit training loops, and the production patterns that make PyTorch useful beyond tutorials.

PyTorch 2.x CUDA + Apple MPS Training Loop Fundamentals April 2026
i
Why PyTorch still matters: it gives you direct control over tensors, gradients, modules, and optimization steps. That explicitness is why it remains strong for research, custom architectures, debugging, and production systems where the training loop itself is part of the product logic.

Module 1: The Foundation (Tensors & Autograd)

PyTorch is a deep learning framework built around tensors and automatic differentiation. The short version is that it lets you write model code in normal Python while still computing gradients efficiently. Compared to higher-level APIs, PyTorch is usually the better choice when you need custom architectures, dynamic control flow, or tighter visibility into what the model is doing at each step.

Research Fit
Use PyTorch when the model architecture changes frequently, includes conditional logic, or needs custom layers and losses.
Engineering Fit
Use it when the training pipeline is not generic and you need explicit control over batching, gradients, or device placement.
Why Not Only High-Level APIs
Higher-level tools are faster for standard tasks, but they usually abstract away the very pieces you need when experiments stop being standard.
Dynamic Computation Graphs Define-by-run

A useful analogy is spreadsheet formulas that update as you perform operations. PyTorch builds the graph as your Python code runs, so debugging and experimentation feel much closer to ordinary programming.

Tensors vs. NumPy

Tensors are PyTorch's core building block because they combine numerical arrays with hardware awareness and autograd integration. NumPy arrays are great for CPU-based scientific computing. Tensors do that job and add GPU acceleration plus gradient tracking. If NumPy arrays are raw materials in a workshop, tensors are raw materials already mounted onto the machine you plan to use.

from __future__ import annotations

import torch


def get_device() -> torch.device:
    # Prefer CUDA when available, then Apple Metal (MPS), then CPU.
    return torch.device(
        "cuda"
        if torch.cuda.is_available()
        else "mps"
        if torch.backends.mps.is_available()
        else "cpu"
    )


device: torch.device = get_device()

# Create tensors directly in PyTorch.
features: torch.Tensor = torch.tensor(
    [[1.0, 2.0], [3.0, 4.0]],
    dtype=torch.float32,
)
weights: torch.Tensor = torch.tensor(
    [[0.5, 1.0], [1.5, -0.5]],
    dtype=torch.float32,
)

# Move data to the selected device so computation happens there.
features = features.to(device)
weights = weights.to(device)

# Matrix multiplication is a first-class tensor operation.
output: torch.Tensor = features @ weights

print(f"Running on device: {device}")
print(output)

# Move back to CPU before converting to NumPy for external libraries.
output_cpu: torch.Tensor = output.to("cpu")
output_numpy = output_cpu.numpy()
print(output_numpy)
Practical rule: keep data as tensors as long as possible. Every unnecessary conversion between tensor and NumPy adds friction and can break gradient-aware workflows.

Autograd

Autograd is PyTorch's automatic differentiation engine. It records tensor operations and computes gradients when you call backward(). That matters because training is fundamentally repeated calculus: you make a prediction, measure the error, and ask how each parameter should change to reduce that error. Autograd is the machinery that answers that question.

from __future__ import annotations

import torch


def get_device() -> torch.device:
    return torch.device(
        "cuda"
        if torch.cuda.is_available()
        else "mps"
        if torch.backends.mps.is_available()
        else "cpu"
    )


device: torch.device = get_device()

# requires_grad=True tells PyTorch to track operations on this tensor.
x: torch.Tensor = torch.tensor(3.0, device=device, requires_grad=True)

# Build any differentiable expression using normal Python math syntax.
y: torch.Tensor = x**2 + 2 * x + 1

# Backpropagate through the graph to compute dy/dx.
y.backward()

print(f"y = {y.item():.2f}")
print(f"dy/dx at x=3 is {x.grad.item():.2f}")

# The analytical derivative is 2x + 2, so at x=3 the gradient is 8.

The important design choice is that PyTorch lets gradients emerge from regular code. You do not manually derive each parameter update in day-to-day modeling. You define the forward computation, and autograd derives the backward pass for supported operations.

Module 2: Building Models (torch.nn)

The torch.nn package gives structure to model code. Instead of managing every parameter tensor manually, you group layers into reusable modules. This is similar to moving from loose functions to a well-encapsulated class design: it organizes state, behavior, and composition in one place.

The nn.Module Base Class

Every trainable PyTorch model subclasses nn.Module. That base class tracks parameters, exposes train() and eval() modes, and makes it easy to move the entire model between CPU, CUDA, and MPS. If tensors are the atoms, nn.Module is the container that turns those atoms into a machine.

from __future__ import annotations

import torch
from torch import nn


class TinyLinearModel(nn.Module):
    def __init__(self, input_dim: int, output_dim: int) -> None:
        super().__init__()
        # Register a trainable layer so PyTorch can find its parameters.
        self.linear = nn.Linear(input_dim, output_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # forward defines the data path through the model.
        return self.linear(x)

Building a Network

An MLP is the standard first example because it shows the two responsibilities every model has: define layers in __init__, then define data flow in forward. PyTorch separates those steps so the architecture is explicit and inspectable.

from __future__ import annotations

import torch
from torch import nn


def get_device() -> torch.device:
    return torch.device(
        "cuda"
        if torch.cuda.is_available()
        else "mps"
        if torch.backends.mps.is_available()
        else "cpu"
    )


class MLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int) -> None:
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x)


device: torch.device = get_device()
model: MLP = MLP(input_dim=16, hidden_dim=32, num_classes=3).to(device)

sample_batch: torch.Tensor = torch.randn(8, 16, device=device)
logits: torch.Tensor = model(sample_batch)

print(logits.shape)  # torch.Size([8, 3])

Module 3: Datasets & DataLoaders

You generally should not load an entire production dataset into RAM because memory becomes the bottleneck before the model does. Real systems stream, batch, shuffle, and prefetch data. The Dataset/DataLoader split exists so data access and model training stay decoupled.

Custom Datasets

A Dataset is the contract for how to fetch one sample at a time. Think of it as the catalog interface for a library: __len__ tells you how many books exist, and __getitem__ tells you how to retrieve one specific book.

from __future__ import annotations

from typing import Tuple

import torch
from torch.utils.data import Dataset


class ToyClassificationDataset(Dataset[Tuple[torch.Tensor, torch.Tensor]]):
    def __init__(self, num_samples: int = 1000, num_features: int = 16, num_classes: int = 3) -> None:
        super().__init__()
        self.features: torch.Tensor = torch.randn(num_samples, num_features)
        self.labels: torch.Tensor = torch.randint(0, num_classes, (num_samples,), dtype=torch.long)

    def __len__(self) -> int:
        return self.features.size(0)

    def __getitem__(self, index: int) -> Tuple[torch.Tensor, torch.Tensor]:
        # Return exactly one feature tensor and one label tensor.
        return self.features[index], self.labels[index]

DataLoaders

The DataLoader handles batching, shuffling, and worker processes so the GPU or MPS device is not left idle waiting for data. In practice, this is the conveyor belt between storage and training.

from __future__ import annotations

import os

import torch
from torch.utils.data import DataLoader


dataset = ToyClassificationDataset(num_samples=2048, num_features=16, num_classes=3)

loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=0 if os.name == "nt" else 2,
    pin_memory=torch.cuda.is_available(),
)

for batch_features, batch_labels in loader:
    print(batch_features.shape, batch_labels.shape)
    break
!
Windows note: worker behavior differs across platforms. Starting with num_workers=0 is the safest baseline when you are debugging data loading locally.

Module 4: The Standard Training Loop

PyTorch makes you write the training loop explicitly because training is where many real experiments differ. Some teams use custom losses, mixed objectives, curriculum learning, dynamic masking, multi-optimizer schedules, or gradient accumulation. A manual loop is not ceremony; it is control.

The 5-Step Process

The core loop is always the same: predict, measure error, clear stale gradients, backpropagate, then update weights. The analogy is steering a vehicle: you observe the road, measure deviation, reset the steering decision, compute the correction, and then actually turn the wheel.

from __future__ import annotations

import os

import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset


def get_device() -> torch.device:
    return torch.device(
        "cuda"
        if torch.cuda.is_available()
        else "mps"
        if torch.backends.mps.is_available()
        else "cpu"
    )


class ToyClassificationDataset(Dataset[tuple[torch.Tensor, torch.Tensor]]):
    def __init__(self, num_samples: int = 1000, num_features: int = 16, num_classes: int = 3) -> None:
        super().__init__()
        self.features: torch.Tensor = torch.randn(num_samples, num_features)
        self.labels: torch.Tensor = torch.randint(0, num_classes, (num_samples,), dtype=torch.long)

    def __len__(self) -> int:
        return self.features.size(0)

    def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
        return self.features[index], self.labels[index]


class MLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int) -> None:
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x)


device: torch.device = get_device()
dataset = ToyClassificationDataset(num_samples=2048, num_features=16, num_classes=3)
loader = DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=0 if os.name == "nt" else 2,
    pin_memory=torch.cuda.is_available(),
)

model = MLP(input_dim=16, hidden_dim=32, num_classes=3).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=1e-3)

num_epochs: int = 3

for epoch in range(num_epochs):
    model.train()
    running_loss: float = 0.0

    for batch_features, batch_labels in loader:
        batch_features = batch_features.to(device)
        batch_labels = batch_labels.to(device)

        # 1. Forward pass: compute predictions from the current model weights.
        logits: torch.Tensor = model(batch_features)

        # 2. Calculate the loss: measure how wrong the predictions are.
        loss: torch.Tensor = criterion(logits, batch_labels)

        # 3. Zero the gradients: clear gradient values from the previous step.
        optimizer.zero_grad()

        # 4. Backward pass: compute gradients of the loss with respect to parameters.
        loss.backward()

        # 5. Update weights: apply the optimizer step using the new gradients.
        optimizer.step()

        running_loss += loss.item()

    average_loss: float = running_loss / len(loader)
    print(f"Epoch {epoch + 1}/{num_epochs} - loss: {average_loss:.4f}")

Module 5: Real-World Use Cases

PyTorch is widely used because it scales from experimentation to production with relatively few mental model changes. The same tensor and module concepts you use for a toy MLP also show up in segmentation, diffusion, video restoration, and transformer fine-tuning.

Computer Vision & Video AI

PyTorch is a strong fit for computer vision because convolutional and transformer-based vision models are ultimately tensor operations plus custom losses. The torchvision ecosystem adds datasets, transforms, pretrained backbones, and image utilities so you can focus on the architecture and objective.

Super-Resolution / Upscaling
You train a model to map a low-resolution tensor to a high-resolution tensor, often with pixel loss, perceptual loss, or adversarial objectives.
Frame Interpolation
You model motion between frames and predict intermediate frames, which requires explicit control over warping, losses, and temporal logic.

A good analogy is film restoration. torchvision provides the tools for reading and transforming the footage, while your PyTorch model defines how to reconstruct missing detail, sharpen features, or infer the in-between motion.

from __future__ import annotations

import torch
from torch import nn
from torchvision import models, transforms


class VisionBackbone(nn.Module):
    def __init__(self, num_classes: int) -> None:
        super().__init__()
        backbone = models.resnet18(weights=None)
        backbone.fc = nn.Linear(backbone.fc.in_features, num_classes)
        self.model = backbone

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.model(x)


image_transform = transforms.Compose(
    [
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ]
)

Natural Language Processing (NLP)

PyTorch is also the foundational engine under much of modern NLP tooling. Libraries like Hugging Face Transformers expose a higher-level interface, but underneath they still rely on PyTorch tensors, modules, optimizers, attention layers, and autograd. If you fine-tune an LLM, you are still ultimately running a PyTorch training loop.

Practical takeaway: even if you use a higher-level library for NLP, understanding PyTorch helps you debug memory issues, customize heads, freeze layers, and reason about performance.

Module 6: Intermediate Features

Once the basics are stable, two features matter immediately in real projects: saving/loading weights correctly and using runtime compilation when it actually helps. These are not advanced for the sake of sophistication; they are table stakes for reproducibility and speed.

Saving & Loading Models

The correct pattern is to save the state_dict() rather than serializing the whole model object. That avoids tight coupling to the exact Python class path and reduces cross-environment loading problems. It is the difference between saving the blueprint and saving a snapshot of the entire workshop.

from __future__ import annotations

import torch
from torch import nn


class MLP(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, num_classes: int) -> None:
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x)


model = MLP(input_dim=16, hidden_dim=32, num_classes=3)

# Save only learned weights and buffers.
torch.save(model.state_dict(), "mlp_state_dict.pt")

# Recreate the model architecture before loading weights.
loaded_model = MLP(input_dim=16, hidden_dim=32, num_classes=3)
state_dict = torch.load("mlp_state_dict.pt", map_location="cpu")
loaded_model.load_state_dict(state_dict)
loaded_model.eval()  # Switch to inference mode for predictable behavior.

Model Compilation

torch.compile(), introduced in PyTorch 2.0, can optimize execution for training and inference by compiling parts of the model graph. Use it as an engineering lever, not a ritual. Benchmark it on your real workload, because speedups vary by model shape and backend.

compiled_model = torch.compile(model)
!
Recommendation: first make the uncompiled version correct, deterministic enough for debugging, and benchmarked. Then add torch.compile() and measure end-to-end impact instead of assuming it will help every case.

Reference Links