PyTorch · TensorFlowCNNs · RNNs · TransformersPre-LLM Foundation
Pre-LLM Foundations
Deep Learning Handbook
Essential architectures and concepts every ML practitioner must know before diving into large language models — CNNs, RNNs, LSTMs, Transformers, attention, embeddings, transfer learning, and fine-tuning.
The conceptual stack — what you need to understand first
Large language models like GPT-4 or Claude are not magic — they are Transformer architectures trained at scale. Understanding that architecture, and the progression from basic neural networks to CNNs to RNNs to the Transformer, gives you a working mental model for everything in modern AI. Without it, fine-tuning becomes cargo-culting and debugging becomes guesswork.
Layers & Parameters
Deep networks are function compositions. Each layer learns a representation. More layers → more abstract features. Parameters are learned via backpropagation + gradient descent.
Representations
Deep learning is representation learning. Raw pixels → edges → shapes → objects. Raw tokens → subword embeddings → contextual meaning. Each concept builds on the last.
The Scaling Hypothesis
At sufficient scale, the same architecture — stacked Transformer blocks — handles vision, language, code, and multimodal tasks. Understanding the unit helps you reason about the whole.
📌
Reading order: Frameworks → CNN → RNN → LSTM → Transformer → Embeddings → Attention → Transfer Learning → Fine-tuning. Each section links to the next conceptually.
⚙️
Frameworks: PyTorch vs TensorFlow/Keras
When to use each, and how they compare
⚡ PyTorch
ParadigmDefine-by-Run (dynamic graph)
Best forResearch, custom architectures
DebuggingNative Python, easy pdb/print
EcosystemHuggingFace, Lightning, torchvision
Industry adoptionDominant in research (2024)
🔶 TensorFlow / Keras
ParadigmStatic graph (+ eager mode)
Best forProduction pipelines, mobile (TFLite)
DeploymentTF Serving, TF.js, TPU-native
Keras 3Multi-backend: TF, JAX, PyTorch
Industry adoptionStrong in production / GCP
PyTorch — Core Concepts
python · pytorch
import torch
import torch.nn as nn
import torch.optim as optim
# ── Tensors — the fundamental data structure ──
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) # shape (2,2)
x = torch.randn(32, 3, 224, 224) # batch of 32 RGB images
x = x.to('cuda') # move to GPU# ── Defining a model ──classSimpleNet(nn.Module):
def__init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(256, 10)
defforward(self, x):
return self.fc2(self.relu(self.fc1(x)))
model = SimpleNet().to('cuda')
# ── Loss + optimizer ──
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)
# ── One training step ──
optimizer.zero_grad() # clear gradients
output = model(x_batch) # forward pass
loss = criterion(output, y_batch) # compute loss
loss.backward() # backprop → compute gradients
optimizer.step() # update weights
TensorFlow / Keras — Core Concepts
python · keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
# ── Sequential API (simple pipelines) ──
model = keras.Sequential([
layers.Dense(256, activation='relu', input_shape=(784,)),
layers.Dropout(0.3),
layers.Dense(10, activation='softmax'),
])
# ── Functional API (multiple inputs/outputs, shared layers) ──
inputs = keras.Input(shape=(784,))
x = layers.Dense(256, activation='relu')(inputs)
outputs = layers.Dense(10, activation='softmax')(x)
model = keras.Model(inputs, outputs)
# ── Compile + train (high-level) ──
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
model.fit(x_train, y_train, epochs=10, batch_size=32,
validation_data=(x_val, y_val))
# ── Custom training loop ──@tf.functiondeftrain_step(x, y):
with tf.GradientTape() as tape:
pred = model(x, training=True)
loss = loss_fn(y, pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
return loss
✅
Rule of thumb: If you're reading papers and writing custom layers, use PyTorch. If you're deploying to production on GCP or building mobile apps, TensorFlow. For learning purposes, PyTorch's pythonic style makes debugging far easier for beginners.
🔲
Convolutional Neural Networks (CNNs)
Spatial feature extraction — the backbone of computer vision
CNNs exploit spatial locality: instead of connecting every neuron to every pixel (fully connected), a small kernel slides across the input, sharing weights. This gives translation invariance and drastically reduces parameters. They are the canonical architecture for images, and their ideas (locality, hierarchy, weight sharing) influenced all subsequent architectures.
Convolution: Slide a small kernel, compute dot products. Detects local patterns (edges, textures, shapes)
ReLU:max(0, x) — non-linearity. Without it, stacking layers collapses to a linear function
BatchNorm: Normalize activations per-batch. Stabilizes training, acts as regularizer
MaxPool: Take the max in a window — reduces spatial size, builds invariance
Landmark Architectures
LeNet-5 (1998): First successful CNN — digit recognition
AlexNet (2012): Ignited the deep learning revolution — won ImageNet by a wide margin
ResNet (2015): Skip connections solve the vanishing gradient problem — enables 100+ layer networks
EfficientNet (2019): Compound scaling — the default choice today for vision tasks
📐
Output size formula: For a conv layer with input size W, kernel K, padding P, stride S: out = ⌊(W - K + 2P) / S⌋ + 1. With K=3, P=1, S=1 you get "same" padding (output = input size).
🔄
Recurrent Neural Networks (RNNs)
Processing sequences — the idea before Transformers
RNNs process sequences by maintaining a hidden state that is updated at each time step. The same weights are applied at every step (weight sharing over time), making them suitable for text, audio, and time series. The core limitation — the vanishing gradient problem over long sequences — motivated LSTM and eventually the Transformer.
📝t=1x₁
→
🔄RNN cellh₁ = f(h₀, x₁)
→
🔄RNN cellh₂ = f(h₁, x₂)
→
🔄RNN cellh₃ = f(h₂, x₃)
→
🎯outputŷ
ht = tanh(Whh · ht-1 + Wxh · xt + bh)
h_t = new hidden state · W_hh = recurrent weight · W_xh = input weight · tanh squashes to [-1, 1]
Vanishing gradients: During backpropagation through time (BPTT), gradients are multiplied at every step. For long sequences, they shrink exponentially — the network can't learn long-range dependencies. This is why vanilla RNNs rarely work beyond ~20-50 time steps, and why LSTM was invented.
🔒
Long Short-Term Memory (LSTM)
Gated memory — solving the vanishing gradient problem
LSTM (Hochreiter & Schmidhuber, 1997) adds a cell state — a memory highway that runs straight through the network — controlled by three gates. This allows the network to selectively remember or forget information over arbitrarily long sequences, making it the dominant sequence model for a decade before Transformers.
The Three Gates
Forget gate (f): What to erase from cell state. f = σ(W_f · [h, x] + b_f)
Input gate (i): What new information to store. i = σ(W_i · [h, x] + b_i)
Output gate (o): What part of cell state to expose as hidden state. o = σ(W_o · [h, x] + b_o)
Cell State Update
Candidate:C̃ = tanh(W_c · [h, x] + b_c)
Update:C_t = f ⊙ C_{t-1} + i ⊙ C̃
Output:h_t = o ⊙ tanh(C_t)
σ = sigmoid (values 0-1 act as "how much to let through")
python · pytorch lstm
classLSTMModel(nn.Module):
def__init__(self, vocab_size, embed_dim, hidden_size, num_layers, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_size,
num_layers=num_layers,
batch_first=True,
dropout=0.3,
bidirectional=True, # run in both directions — doubles hidden dim
)
# bidirectional: hidden = 2 × hidden_size
self.dropout = nn.Dropout(0.4)
self.fc = nn.Linear(hidden_size * 2, num_classes)
defforward(self, x):
# x: (batch, seq_len) ← token ids
embedded = self.dropout(self.embedding(x)) # (B, T, embed_dim)# out: (B, T, 2*H) | (h_n, c_n): (layers*2, B, H) each
out, (h_n, c_n) = self.lstm(embedded)
# Concatenate last hidden states from both directions
h_last = torch.cat([h_n[-2], h_n[-1]], dim=1) # (B, 2*H)return self.fc(self.dropout(h_last))
# ── GRU — simpler alternative to LSTM ──# Merges forget + input into a single update gate. Fewer parameters, often comparable performance.
self.gru = nn.GRU(input_size=embed_dim, hidden_size=hidden_size,
num_layers=2, batch_first=True, bidirectional=True)
🔑
LSTM vs GRU: GRU (2014) simplifies LSTM by merging the forget and input gates into a single "update gate" and eliminating the cell state. GRU has fewer parameters, trains faster, and performs comparably on most tasks. Both are obsolete for long-range tasks — use Transformers — but still useful for short sequences where causal structure matters.
⚡
Transformers
"Attention Is All You Need" — the architecture powering modern AI
Introduced by Vaswani et al. in 2017, the Transformer replaces recurrence entirely with self-attention: every position in the sequence attends to every other position simultaneously, in parallel. This solved the long-range dependency problem and enabled massive GPU parallelism — the key to scaling.
🔤inputTokens
→
🗂️embeddingToken + Pos
→
👁️multi-head attnSelf-Attention
→
🔧feed-forwardFFN × N layers
→
🎯projectionLogits
python · pytorch transformer block
import torch
import torch.nn as nn
import math
classTransformerBlock(nn.Module):
"""Single Transformer encoder layer (Pre-LN variant)."""def__init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.GELU(), # GELU (not ReLU) used in most modern models
nn.Dropout(dropout),
nn.Linear(d_ff, d_model),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
defforward(self, x, mask=None):
# x: (batch, seq_len, d_model)# Pre-LN: normalize BEFORE attention (more stable than original Post-LN)
x_norm = self.norm1(x)
attn_out, _ = self.attn(x_norm, x_norm, x_norm, attn_mask=mask)
x = x + self.dropout(attn_out) # residual connection# Feed-forward sublayer
x = x + self.dropout(self.ff(self.norm2(x)))
return x
classTransformerLM(nn.Module):
"""Decoder-only language model (GPT-style)."""def__init__(self, vocab_size, d_model=512, n_layers=6, n_heads=8, max_len=1024):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_len, d_model) # learned positional embeddings
self.blocks = nn.ModuleList([TransformerBlock(d_model, n_heads) for _ inrange(n_layers)])
self.norm = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
defforward(self, idx):
B, T = idx.shape
tok = self.tok_emb(idx)
pos = self.pos_emb(torch.arange(T, device=idx.device))
x = tok + pos # (B, T, d_model)
causal_mask = nn.Transformer.generate_square_subsequent_mask(T)
for block in self.blocks:
x = block(x, mask=causal_mask) # each block attends causally
logits = self.head(self.norm(x)) # (B, T, vocab_size)return logits
Encoder vs Decoder vs Encoder-Decoder
Encoder only (BERT): Bidirectional. Sees all tokens. Best for classification, NER, QA (extractive)
Decoder only (GPT): Causal. Each token only sees past. Best for generation
Encoder-Decoder (T5, BART): Input fully attended, then decoded auto-regressively. Best for translation, summarization
Critical Design Choices
Positional encoding: Sin/cos (original), learned, RoPE, ALiBi — tells the model token positions
Layer normalization: Pre-LN (inside residual) is more stable than original Post-LN
Activation: GELU preferred over ReLU in modern models
Residual connections: Essential — allow gradients to flow through deep stacks unchanged
🗂️
Embeddings
Mapping discrete symbols into dense vector spaces
An embedding is a learned mapping from a discrete symbol (word, token, user ID, pixel patch) to a dense vector. The embedding space captures semantic relationships: similar items land near each other. This is how neural networks work with discrete data — they turn it into a geometry the network can reason over.
Word2Vec / GloVe (Pre-Neural Era)
Trained by predicting surrounding words (Word2Vec) or global co-occurrence (GloVe)
Static: one vector per word, regardless of context
Famous property: king - man + woman ≈ queen
Still useful as lightweight baselines and for understanding the embedding concept
Contextual Embeddings (Transformer Era)
Each token gets a different vector depending on its context (BERT, GPT)
"bank" in "river bank" ≠ "bank" in "bank account"
The hidden states at each layer ARE the contextual embeddings
Sentence embeddings: pool or use [CLS] token for whole-sequence representation
Embeddings are the interface: Every modality needs an embedding — images (patch embeddings in ViT), audio (waveform patches), code tokens, user IDs in recommendation systems. Understanding how embeddings work is the foundation for understanding multimodal models and RAG (retrieval-augmented generation) in LLMs.
👁️
Attention Mechanism
Scaled dot-product attention — the core of Transformers
Attention answers the question: for each position in the output, which positions in the input should I focus on? The Transformer's self-attention computes this as a weighted sum of values, where weights come from comparing a query vector against all key vectors.
Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V
Q = Query matrix (what we're looking for) ·
K = Key matrix (what each position offers) ·
V = Value matrix (what each position contains) ·
√d_k = scaling factor to prevent gradient vanishing in softmax
python · scaled dot-product attention from scratch
import torch
import torch.nn.functional as F
import math
defscaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (batch, heads, seq_len, d_k)
mask: (seq_len, seq_len) — True positions are masked out
"""
d_k = Q.shape[-1]
# (B, H, T, T) — affinity scores for every query-key pair
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# Causal mask: prevent attending to future tokensif mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1) # attention distributionreturn torch.matmul(weights, V), weights # weighted sum of valuesclassMultiHeadAttention(nn.Module):
"""Multi-head attention: run attention in parallel over h subspaces."""def__init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.d_k = d_model // n_heads
self.n_heads= n_heads
# Single weight matrix for Q, K, V (efficient)
self.W_qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.W_out = nn.Linear(d_model, d_model, bias=False)
defforward(self, x, mask=None):
B, T, C = x.shape
# Project and split into heads
qkv = self.W_qkv(x).chunk(3, dim=-1) # 3 × (B, T, C)
Q, K, V = [t.view(B, T, self.n_heads, self.d_k)
.transpose(1, 2) for t in qkv] # (B, H, T, d_k)
attn_out, weights = scaled_dot_product_attention(Q, K, V, mask)
# Merge heads back and project
out = attn_out.transpose(1, 2).contiguous().view(B, T, C)
return self.W_out(out), weights
Why Multi-Head?
Running attention in h parallel heads, each with a lower-dimensional subspace, lets the model attend to different aspects simultaneously — one head might track syntactic relationships, another semantic similarity, another positional proximity. The outputs are concatenated and projected back to the original dimension.
Variants
Cross-attention: Q from decoder, K and V from encoder — used in encoder-decoder models for translation
Grouped Query Attention (GQA): Fewer K/V heads than Q heads — reduces KV cache memory at inference. Used in Llama 3, Mistral
Flash Attention: IO-aware recomputation — 2-4× faster on hardware. Doesn't change math, only computation order
📐
Quadratic complexity: Self-attention is O(T²·d) in both time and memory — every token attends to every other token. For T=1024 this is fine. For T=100K (long documents), it's a bottleneck. This motivates sparse attention (Longformer), linear attention, and sliding window attention (Mistral).
♻️
Transfer Learning
Reusing representations learned on large datasets
Training deep models from scratch requires massive datasets and compute. Transfer learning sidesteps this: take a model pre-trained on a large general dataset (ImageNet, The Pile, Common Crawl) and adapt it to your specific task. The model has already learned useful low-level features (edges, textures, grammar, factual knowledge) that transfer.
Feature Extraction (Frozen)
Freeze all pre-trained weights. Only train a new classification head on top. Fast, memory-efficient, works with tiny datasets. Use when your data is very small or very different from the pre-training domain.
Full Fine-tuning
Unfreeze all (or most) layers and continue training on your data, typically with a much smaller learning rate. More expensive but reaches higher accuracy when you have sufficient task data (thousands+ examples).
python · transfer learning with torchvision
import torchvision.models as models
import torch.nn as nn
NUM_CLASSES = 5# ── Load pre-trained ResNet-50 ──
model = models.resnet50(weights='IMAGENET1K_V2')
# ── Strategy 1: Feature extraction — freeze backbone ──for param in model.parameters():
param.requires_grad = False# Replace final FC layer (the head) — only this trains
model.fc = nn.Sequential(
nn.Linear(model.fc.in_features, 256),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(256, NUM_CLASSES)
)
# ── Strategy 2: Fine-tune everything ──
model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)
# Differential learning rates: lower lr for backbone, higher for head
optimizer = torch.optim.AdamW([
{'params': [p for n, p in model.named_parameters() if'fc'not in n],
'lr': 1e-5}, # backbone: small lr
{'params': model.fc.parameters(),
'lr': 1e-3}, # head: normal lr
])
# ── Strategy 3: Progressive unfreezing (ULMFiT pattern) ──defunfreeze_layers(model, n):
"""Unfreeze last n layers of model.children()"""
children = list(model.children())
for child in children[-n:]:
for param in child.parameters():
param.requires_grad = Trueunfreeze_layers(model, 3) # start with last 3 blocks
✅
Practical transfer learning recipe: (1) Freeze backbone + train head for 2-3 epochs. (2) Unfreeze + fine-tune with 10× smaller LR for 5-10 epochs. (3) Use cosine LR schedule with warmup. This approach consistently outperforms training from scratch with fewer than 50K samples.
🎛️
Fine-tuning
Adapting pre-trained language models to specific tasks
In the language model context, fine-tuning adapts a pre-trained model (BERT, GPT-2, LLaMA) to a downstream task using a comparatively small labeled dataset. The entire weight space was shaped during pre-training; fine-tuning nudges those weights toward the task distribution.
Full Fine-tuning
Update all parameters. Expensive, requires significant GPU memory proportional to model size. Risk of catastrophic forgetting. Use when task data is large and task is very different from pre-training.
LoRA (Low-Rank Adaptation)
Freeze the model. Inject trainable low-rank matrices into attention layers. Trains 0.1–1% of parameters. State of the art for LLM adaptation. Memory footprint is tiny.
Prompt Tuning / Prefix
Freeze the entire model. Only train a small set of "soft prompt" tokens prepended to the input. Extreme parameter efficiency but lower ceiling for task-specific adaptation.
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets importload_datasetimport evaluate
import numpy as np
MODEL_NAME = "bert-base-uncased"# ── Load tokenizer + model ──
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_NAME, num_labels=2
)
# ── Tokenize dataset ──
dataset = load_dataset("imdb")
defpreprocess(examples):
returntokenizer(examples["text"], truncation=True, max_length=512)
tokenized = dataset.map(preprocess, batched=True)
# ── Define metric ──
accuracy = evaluate.load("accuracy")
defcompute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return accuracy.compute(predictions=predictions, references=labels)
# ── Training arguments ──
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=32,
learning_rate=2e-5, # much smaller than training from scratch
weight_decay=0.01,
lr_scheduler_type="cosine", # decay LR over training
warmup_ratio=0.1, # warmup for first 10% of steps
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
fp16=True, # half-precision training — 2× faster
)
trainer = Trainer(
model=model, args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
tokenizer=tokenizer,
data_collator=DataCollatorWithPadding(tokenizer),
compute_metrics=compute_metrics,
)
trainer.train()
LoRA Fine-tuning (Parameter-Efficient)
python · peft / lora
from peft importLoraConfig, get_peft_model, TaskTypefrom transformers importAutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank — higher = more capacity, more params
lora_alpha=32, # scaling factor (alpha/r is the effective scale)
target_modules=["q_proj", "v_proj"], # inject into attention Q and V
lora_dropout=0.05,
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 2,097,152 || all params: 1,237,026,816 || trainable%: 0.17%
🔑
LoRA intuition: Instead of learning a full weight delta ΔW (d×d matrix), LoRA decomposes it as ΔW = A·B where A is d×r and B is r×d, with rank r ≪ d. For d=4096 and r=16 that's 4096² ≈ 16M vs 2×4096×16 ≈ 131K parameters — a 120× reduction. Empirically it matches full fine-tuning performance on most tasks.
📉
Training Loop Essentials
The machinery behind learning — loss, optimizers, schedulers
python · complete pytorch training loop
from torch.utils.data importDataLoaderimport torch
import torch.nn as nn
deftrain(model, train_loader, val_loader, n_epochs=10, device='cuda'):
optimizer = torch.optim.AdamW(
model.parameters(), lr=3e-4, weight_decay=0.01
)
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=3e-4,
steps_per_epoch=len(train_loader),
epochs=n_epochs,
)
criterion = nn.CrossEntropyLoss()
scaler = torch.cuda.amp.GradScaler() # automatic mixed precision
best_val_acc, patience_counter = 0, 0for epoch inrange(n_epochs):
# ── Training phase ──
model.train()
train_loss = 0for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad(set_to_none=True) # slightly faster than zero_grad()with torch.autocast(device_type='cuda', dtype=torch.float16):
logits = model(x)
loss = criterion(logits, y)
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
scaler.step(optimizer)
scaler.update()
scheduler.step()
train_loss += loss.item()
# ── Validation phase ──
model.eval()
correct = total = 0with torch.no_grad():
for x, y in val_loader:
x, y = x.to(device), y.to(device)
preds = model(x).argmax(dim=1)
correct += (preds == y).sum().item()
total += y.size(0)
val_acc = correct / total
print(f"Epoch {epoch+1} | loss {train_loss/len(train_loader):.4f} | val_acc {val_acc:.4f}")
# ── Early stopping ──if val_acc > best_val_acc:
best_val_acc = val_acc
torch.save(model.state_dict(), 'best.pt')
patience_counter = 0else:
patience_counter += 1if patience_counter >= 5: break
Component
Common Choice
When to Change
Optimizer
AdamW
SGD+momentum for large vision models; Adafactor for huge LLMs (saves memory)
Learning rate
3e-4 (new), 2e-5 (fine-tune)
Use LR finder or cosine schedule with warmup; always schedule
Loss (classification)
CrossEntropyLoss
Label smoothing (0.1) for overconfident models; focal loss for class imbalance
Loss (regression)
MSELoss
Huber loss (SmoothL1) when outliers are present
Scheduler
OneCycleLR
Cosine annealing with warmup for Transformer fine-tuning
Gradient clipping
max_norm=1.0
Always apply for RNNs and Transformers; reduces instability
Mixed precision
torch.autocast + GradScaler
Always — 2× speedup and 2× memory reduction on modern GPUs for free
🛡️
Regularization
Preventing overfitting — the most common real-world challenge
Dropout
Randomly zero out activations during training with probability p. Forces the network to not rely on any single feature. Equivalent to training an ensemble of 2ⁿ networks. Disable during inference (model.eval() handles this automatically).
nn.Dropout(p=0.5) — typical; use 0.1–0.2 in Transformers
Batch Normalization
Normalize layer inputs to zero mean, unit variance per-batch, then learn scale/shift. Reduces internal covariate shift. Enables higher learning rates. A form of regularization through noise injection (batch statistics vary). Standard in CNNs; replaced by Layer Norm in Transformers.
Weight Decay (L2)
Add penalty λ||W||² to the loss, shrinking weights toward zero. In AdamW, this is decoupled from the adaptive learning rate — use AdamW, not Adam + L2. Typical: weight_decay=0.01.
Data Augmentation
Artificially expand training data by applying label-preserving transforms. For vision: random crop, flip, color jitter, mixup, cutmix. For text: back-translation, random deletion/swap, paraphrase generation. Most effective regularizer when data is scarce.
✅
Regularization priority: (1) Get more data / augment first. (2) Reduce model capacity (fewer layers / filters). (3) Add dropout. (4) Add weight decay. Never add all at once — you won't know what helped. And always monitor train vs. validation loss to diagnose overfitting vs. underfitting.
🗒️
Cheat Sheet
Quick reference for architecture selection and hyperparameter defaults
Larger batches → more stable but wider minima; use gradient accumulation if OOM
Dropout (CNN)
0.3–0.5
Applied before final FC layer
Dropout (Transformer)
0.1
Applied to attention weights and FFN
Weight decay
0.01
Applies to weights, not biases/norms
Gradient clip
1.0
Essential for RNNs, recommended for Transformers
Warmup steps
5–10% of total steps
Crucial for large models to avoid early instability
LoRA rank r
8–16
Start at 8; increase if performance plateaus
Debugging Loss Curves
Overfitting
Training loss ↓, validation loss ↑ (diverges after some point)
Fix: more data, augmentation, dropout, weight decay, reduce model size, early stopping
Underfitting
Both train and val loss stay high, don't converge
Fix: larger model, more epochs, higher learning rate, less regularization
Exploding Gradients
Loss becomes NaN or jumps wildly
Fix: gradient clipping (max_norm=1.0), lower LR, use mixed precision carefully
Loss Not Moving
Loss flat from epoch 1
Fix: check LR (too small), check data pipeline (all same class?), verify model.train() is called, check gradient flow with param.grad
📚
Essential resources: Andrej Karpathy's "Neural Networks: Zero to Hero" series (builds everything from scratch) · fast.ai Practical Deep Learning · Dive into Deep Learning (d2l.ai) · HuggingFace course (free) · Stanford CS231n (vision) and CS224n (NLP).