PyTorch: Beginner to Intermediate Handbook
A developer-focused guide to tensors, autograd, model construction, data pipelines, explicit training loops, and the production patterns that make PyTorch useful beyond tutorials.
Module 1: The Foundation (Tensors & Autograd)
PyTorch is a deep learning framework built around tensors and automatic differentiation. The short version is that it lets you write model code in normal Python while still computing gradients efficiently. Compared to higher-level APIs, PyTorch is usually the better choice when you need custom architectures, dynamic control flow, or tighter visibility into what the model is doing at each step.
A useful analogy is spreadsheet formulas that update as you perform operations. PyTorch builds the graph as your Python code runs, so debugging and experimentation feel much closer to ordinary programming.
Tensors vs. NumPy
Tensors are PyTorch's core building block because they combine numerical arrays with hardware awareness and autograd integration. NumPy arrays are great for CPU-based scientific computing. Tensors do that job and add GPU acceleration plus gradient tracking. If NumPy arrays are raw materials in a workshop, tensors are raw materials already mounted onto the machine you plan to use.
from __future__ import annotations
import torch
def get_device() -> torch.device:
# Prefer CUDA when available, then Apple Metal (MPS), then CPU.
return torch.device(
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
device: torch.device = get_device()
# Create tensors directly in PyTorch.
features: torch.Tensor = torch.tensor(
[[1.0, 2.0], [3.0, 4.0]],
dtype=torch.float32,
)
weights: torch.Tensor = torch.tensor(
[[0.5, 1.0], [1.5, -0.5]],
dtype=torch.float32,
)
# Move data to the selected device so computation happens there.
features = features.to(device)
weights = weights.to(device)
# Matrix multiplication is a first-class tensor operation.
output: torch.Tensor = features @ weights
print(f"Running on device: {device}")
print(output)
# Move back to CPU before converting to NumPy for external libraries.
output_cpu: torch.Tensor = output.to("cpu")
output_numpy = output_cpu.numpy()
print(output_numpy)
Autograd
Autograd is PyTorch's automatic differentiation engine. It records tensor operations and computes gradients when you call backward(). That matters because training is fundamentally repeated calculus: you make a prediction, measure the error, and ask how each parameter should change to reduce that error. Autograd is the machinery that answers that question.
from __future__ import annotations
import torch
def get_device() -> torch.device:
return torch.device(
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
device: torch.device = get_device()
# requires_grad=True tells PyTorch to track operations on this tensor.
x: torch.Tensor = torch.tensor(3.0, device=device, requires_grad=True)
# Build any differentiable expression using normal Python math syntax.
y: torch.Tensor = x**2 + 2 * x + 1
# Backpropagate through the graph to compute dy/dx.
y.backward()
print(f"y = {y.item():.2f}")
print(f"dy/dx at x=3 is {x.grad.item():.2f}")
# The analytical derivative is 2x + 2, so at x=3 the gradient is 8.
The important design choice is that PyTorch lets gradients emerge from regular code. You do not manually derive each parameter update in day-to-day modeling. You define the forward computation, and autograd derives the backward pass for supported operations.
Module 2: Building Models (torch.nn)
The torch.nn package gives structure to model code. Instead of managing every parameter tensor manually, you group layers into reusable modules. This is similar to moving from loose functions to a well-encapsulated class design: it organizes state, behavior, and composition in one place.
The nn.Module Base Class
Every trainable PyTorch model subclasses nn.Module. That base class tracks parameters, exposes train() and eval() modes, and makes it easy to move the entire model between CPU, CUDA, and MPS. If tensors are the atoms, nn.Module is the container that turns those atoms into a machine.
from __future__ import annotations
import torch
from torch import nn
class TinyLinearModel(nn.Module):
def __init__(self, input_dim: int, output_dim: int) -> None:
super().__init__()
# Register a trainable layer so PyTorch can find its parameters.
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# forward defines the data path through the model.
return self.linear(x)
Building a Network
An MLP is the standard first example because it shows the two responsibilities every model has: define layers in __init__, then define data flow in forward. PyTorch separates those steps so the architecture is explicit and inspectable.
from __future__ import annotations
import torch
from torch import nn
def get_device() -> torch.device:
return torch.device(
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
class MLP(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, num_classes: int) -> None:
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.network(x)
device: torch.device = get_device()
model: MLP = MLP(input_dim=16, hidden_dim=32, num_classes=3).to(device)
sample_batch: torch.Tensor = torch.randn(8, 16, device=device)
logits: torch.Tensor = model(sample_batch)
print(logits.shape) # torch.Size([8, 3])
Module 3: Datasets & DataLoaders
You generally should not load an entire production dataset into RAM because memory becomes the bottleneck before the model does. Real systems stream, batch, shuffle, and prefetch data. The Dataset/DataLoader split exists so data access and model training stay decoupled.
Custom Datasets
A Dataset is the contract for how to fetch one sample at a time. Think of it as the catalog interface for a library: __len__ tells you how many books exist, and __getitem__ tells you how to retrieve one specific book.
from __future__ import annotations
from typing import Tuple
import torch
from torch.utils.data import Dataset
class ToyClassificationDataset(Dataset[Tuple[torch.Tensor, torch.Tensor]]):
def __init__(self, num_samples: int = 1000, num_features: int = 16, num_classes: int = 3) -> None:
super().__init__()
self.features: torch.Tensor = torch.randn(num_samples, num_features)
self.labels: torch.Tensor = torch.randint(0, num_classes, (num_samples,), dtype=torch.long)
def __len__(self) -> int:
return self.features.size(0)
def __getitem__(self, index: int) -> Tuple[torch.Tensor, torch.Tensor]:
# Return exactly one feature tensor and one label tensor.
return self.features[index], self.labels[index]
DataLoaders
The DataLoader handles batching, shuffling, and worker processes so the GPU or MPS device is not left idle waiting for data. In practice, this is the conveyor belt between storage and training.
from __future__ import annotations
import os
import torch
from torch.utils.data import DataLoader
dataset = ToyClassificationDataset(num_samples=2048, num_features=16, num_classes=3)
loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=0 if os.name == "nt" else 2,
pin_memory=torch.cuda.is_available(),
)
for batch_features, batch_labels in loader:
print(batch_features.shape, batch_labels.shape)
break
num_workers=0 is the safest baseline when you are debugging data loading locally.Module 4: The Standard Training Loop
PyTorch makes you write the training loop explicitly because training is where many real experiments differ. Some teams use custom losses, mixed objectives, curriculum learning, dynamic masking, multi-optimizer schedules, or gradient accumulation. A manual loop is not ceremony; it is control.
The 5-Step Process
The core loop is always the same: predict, measure error, clear stale gradients, backpropagate, then update weights. The analogy is steering a vehicle: you observe the road, measure deviation, reset the steering decision, compute the correction, and then actually turn the wheel.
from __future__ import annotations
import os
import torch
from torch import nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
def get_device() -> torch.device:
return torch.device(
"cuda"
if torch.cuda.is_available()
else "mps"
if torch.backends.mps.is_available()
else "cpu"
)
class ToyClassificationDataset(Dataset[tuple[torch.Tensor, torch.Tensor]]):
def __init__(self, num_samples: int = 1000, num_features: int = 16, num_classes: int = 3) -> None:
super().__init__()
self.features: torch.Tensor = torch.randn(num_samples, num_features)
self.labels: torch.Tensor = torch.randint(0, num_classes, (num_samples,), dtype=torch.long)
def __len__(self) -> int:
return self.features.size(0)
def __getitem__(self, index: int) -> tuple[torch.Tensor, torch.Tensor]:
return self.features[index], self.labels[index]
class MLP(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, num_classes: int) -> None:
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.network(x)
device: torch.device = get_device()
dataset = ToyClassificationDataset(num_samples=2048, num_features=16, num_classes=3)
loader = DataLoader(
dataset,
batch_size=64,
shuffle=True,
num_workers=0 if os.name == "nt" else 2,
pin_memory=torch.cuda.is_available(),
)
model = MLP(input_dim=16, hidden_dim=32, num_classes=3).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=1e-3)
num_epochs: int = 3
for epoch in range(num_epochs):
model.train()
running_loss: float = 0.0
for batch_features, batch_labels in loader:
batch_features = batch_features.to(device)
batch_labels = batch_labels.to(device)
# 1. Forward pass: compute predictions from the current model weights.
logits: torch.Tensor = model(batch_features)
# 2. Calculate the loss: measure how wrong the predictions are.
loss: torch.Tensor = criterion(logits, batch_labels)
# 3. Zero the gradients: clear gradient values from the previous step.
optimizer.zero_grad()
# 4. Backward pass: compute gradients of the loss with respect to parameters.
loss.backward()
# 5. Update weights: apply the optimizer step using the new gradients.
optimizer.step()
running_loss += loss.item()
average_loss: float = running_loss / len(loader)
print(f"Epoch {epoch + 1}/{num_epochs} - loss: {average_loss:.4f}")
Module 5: Real-World Use Cases
PyTorch is widely used because it scales from experimentation to production with relatively few mental model changes. The same tensor and module concepts you use for a toy MLP also show up in segmentation, diffusion, video restoration, and transformer fine-tuning.
Computer Vision & Video AI
PyTorch is a strong fit for computer vision because convolutional and transformer-based vision models are ultimately tensor operations plus custom losses. The torchvision ecosystem adds datasets, transforms, pretrained backbones, and image utilities so you can focus on the architecture and objective.
A good analogy is film restoration. torchvision provides the tools for reading and transforming the footage, while your PyTorch model defines how to reconstruct missing detail, sharpen features, or infer the in-between motion.
from __future__ import annotations
import torch
from torch import nn
from torchvision import models, transforms
class VisionBackbone(nn.Module):
def __init__(self, num_classes: int) -> None:
super().__init__()
backbone = models.resnet18(weights=None)
backbone.fc = nn.Linear(backbone.fc.in_features, num_classes)
self.model = backbone
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.model(x)
image_transform = transforms.Compose(
[
transforms.Resize((224, 224)),
transforms.ToTensor(),
]
)
Natural Language Processing (NLP)
PyTorch is also the foundational engine under much of modern NLP tooling. Libraries like Hugging Face Transformers expose a higher-level interface, but underneath they still rely on PyTorch tensors, modules, optimizers, attention layers, and autograd. If you fine-tune an LLM, you are still ultimately running a PyTorch training loop.
Module 6: Intermediate Features
Once the basics are stable, two features matter immediately in real projects: saving/loading weights correctly and using runtime compilation when it actually helps. These are not advanced for the sake of sophistication; they are table stakes for reproducibility and speed.
Saving & Loading Models
The correct pattern is to save the state_dict() rather than serializing the whole model object. That avoids tight coupling to the exact Python class path and reduces cross-environment loading problems. It is the difference between saving the blueprint and saving a snapshot of the entire workshop.
from __future__ import annotations
import torch
from torch import nn
class MLP(nn.Module):
def __init__(self, input_dim: int, hidden_dim: int, num_classes: int) -> None:
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.network(x)
model = MLP(input_dim=16, hidden_dim=32, num_classes=3)
# Save only learned weights and buffers.
torch.save(model.state_dict(), "mlp_state_dict.pt")
# Recreate the model architecture before loading weights.
loaded_model = MLP(input_dim=16, hidden_dim=32, num_classes=3)
state_dict = torch.load("mlp_state_dict.pt", map_location="cpu")
loaded_model.load_state_dict(state_dict)
loaded_model.eval() # Switch to inference mode for predictable behavior.
Model Compilation
torch.compile(), introduced in PyTorch 2.0, can optimize execution for training and inference by compiling parts of the model graph. Use it as an engineering lever, not a ritual. Benchmark it on your real workload, because speedups vary by model shape and backend.
compiled_model = torch.compile(model)
torch.compile() and measure end-to-end impact instead of assuming it will help every case.Reference Links
- PyTorch Official PyTorch documentation
- Autograd Autograd tutorial
- torch.nn torch.nn reference
- Data Dataset and DataLoader docs
- torchvision torchvision documentation
- Compile PyTorch 2 compile overview