V1
Back to handbooks index
Deep Learning Handbook
PyTorch · TensorFlow CNNs · RNNs · Transformers Pre-LLM Foundation
Pre-LLM Foundations

Deep Learning
Handbook

Essential architectures and concepts every ML practitioner must know before diving into large language models — CNNs, RNNs, LSTMs, Transformers, attention, embeddings, transfer learning, and fine-tuning.

PyTorch TensorFlow / Keras Transformers CNNs RNNs · LSTMs Attention Embeddings Transfer Learning
🧠

Why Deep Learning Before LLMs

The conceptual stack — what you need to understand first

Large language models like GPT-4 or Claude are not magic — they are Transformer architectures trained at scale. Understanding that architecture, and the progression from basic neural networks to CNNs to RNNs to the Transformer, gives you a working mental model for everything in modern AI. Without it, fine-tuning becomes cargo-culting and debugging becomes guesswork.

Layers & Parameters

Deep networks are function compositions. Each layer learns a representation. More layers → more abstract features. Parameters are learned via backpropagation + gradient descent.

Representations

Deep learning is representation learning. Raw pixels → edges → shapes → objects. Raw tokens → subword embeddings → contextual meaning. Each concept builds on the last.

The Scaling Hypothesis

At sufficient scale, the same architecture — stacked Transformer blocks — handles vision, language, code, and multimodal tasks. Understanding the unit helps you reason about the whole.

📌
Reading order: Frameworks → CNN → RNN → LSTM → Transformer → Embeddings → Attention → Transfer Learning → Fine-tuning. Each section links to the next conceptually.
⚙️

Frameworks: PyTorch vs TensorFlow/Keras

When to use each, and how they compare
⚡ PyTorch
ParadigmDefine-by-Run (dynamic graph)
Best forResearch, custom architectures
DebuggingNative Python, easy pdb/print
EcosystemHuggingFace, Lightning, torchvision
Industry adoptionDominant in research (2024)
🔶 TensorFlow / Keras
ParadigmStatic graph (+ eager mode)
Best forProduction pipelines, mobile (TFLite)
DeploymentTF Serving, TF.js, TPU-native
Keras 3Multi-backend: TF, JAX, PyTorch
Industry adoptionStrong in production / GCP

PyTorch — Core Concepts

python · pytorch
import torch import torch.nn as nn import torch.optim as optim # ── Tensors — the fundamental data structure ── x = torch.tensor([[1.0, 2.0], [3.0, 4.0]]) # shape (2,2) x = torch.randn(32, 3, 224, 224) # batch of 32 RGB images x = x.to('cuda') # move to GPU # ── Defining a model ── class SimpleNet(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(784, 256) self.relu = nn.ReLU() self.fc2 = nn.Linear(256, 10) def forward(self, x): return self.fc2(self.relu(self.fc1(x))) model = SimpleNet().to('cuda') # ── Loss + optimizer ── criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=1e-3) # ── One training step ── optimizer.zero_grad() # clear gradients output = model(x_batch) # forward pass loss = criterion(output, y_batch) # compute loss loss.backward() # backprop → compute gradients optimizer.step() # update weights

TensorFlow / Keras — Core Concepts

python · keras
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers # ── Sequential API (simple pipelines) ── model = keras.Sequential([ layers.Dense(256, activation='relu', input_shape=(784,)), layers.Dropout(0.3), layers.Dense(10, activation='softmax'), ]) # ── Functional API (multiple inputs/outputs, shared layers) ── inputs = keras.Input(shape=(784,)) x = layers.Dense(256, activation='relu')(inputs) outputs = layers.Dense(10, activation='softmax')(x) model = keras.Model(inputs, outputs) # ── Compile + train (high-level) ── model.compile( optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'] ) model.fit(x_train, y_train, epochs=10, batch_size=32, validation_data=(x_val, y_val)) # ── Custom training loop ── @tf.function def train_step(x, y): with tf.GradientTape() as tape: pred = model(x, training=True) loss = loss_fn(y, pred) grads = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(grads, model.trainable_variables)) return loss
Rule of thumb: If you're reading papers and writing custom layers, use PyTorch. If you're deploying to production on GCP or building mobile apps, TensorFlow. For learning purposes, PyTorch's pythonic style makes debugging far easier for beginners.
🔲

Convolutional Neural Networks (CNNs)

Spatial feature extraction — the backbone of computer vision

CNNs exploit spatial locality: instead of connecting every neuron to every pixel (fully connected), a small kernel slides across the input, sharing weights. This gives translation invariance and drastically reduces parameters. They are the canonical architecture for images, and their ideas (locality, hierarchy, weight sharing) influenced all subsequent architectures.

🖼️inputImage (H×W×C)
🔲conv + reluFeature Maps
⬇️poolSpatial Reduction
🔲deeper convAbstract Features
➡️flattenFeature Vector
🎯fc + softmaxLogits
python · pytorch cnn
import torch.nn as nn class ConvNet(nn.Module): def __init__(self, num_classes=10): super().__init__() # ── Feature extractor ── self.features = nn.Sequential( # Conv block 1: 3×3 kernel, 32 filters, same padding nn.Conv2d(3, 32, kernel_size=3, padding=1), # (B,3,H,W)→(B,32,H,W) nn.BatchNorm2d(32), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2), # halve spatial dims # Conv block 2: 64 filters nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.BatchNorm2d(64), nn.ReLU(inplace=True), nn.MaxPool2d(2, 2), # Conv block 3: 128 filters nn.Conv2d(64, 128, kernel_size=3, padding=1), nn.BatchNorm2d(128), nn.ReLU(inplace=True), nn.AdaptiveAvgPool2d((4, 4)), # fixed output size ) # ── Classifier ── self.classifier = nn.Sequential( nn.Flatten(), # (B,128,4,4) → (B,2048) nn.Linear(128 * 4 * 4, 512), nn.ReLU(inplace=True), nn.Dropout(0.5), nn.Linear(512, num_classes), ) def forward(self, x): x = self.features(x) return self.classifier(x) # ── Output shape inspection ── model = ConvNet(num_classes=10) x = torch.randn(4, 3, 64, 64) # batch=4, RGB, 64×64 print(model(x).shape) # → torch.Size([4, 10])
Key Operations
  • Convolution: Slide a small kernel, compute dot products. Detects local patterns (edges, textures, shapes)
  • ReLU: max(0, x) — non-linearity. Without it, stacking layers collapses to a linear function
  • BatchNorm: Normalize activations per-batch. Stabilizes training, acts as regularizer
  • MaxPool: Take the max in a window — reduces spatial size, builds invariance
Landmark Architectures
  • LeNet-5 (1998): First successful CNN — digit recognition
  • AlexNet (2012): Ignited the deep learning revolution — won ImageNet by a wide margin
  • ResNet (2015): Skip connections solve the vanishing gradient problem — enables 100+ layer networks
  • EfficientNet (2019): Compound scaling — the default choice today for vision tasks
📐
Output size formula: For a conv layer with input size W, kernel K, padding P, stride S: out = ⌊(W - K + 2P) / S⌋ + 1. With K=3, P=1, S=1 you get "same" padding (output = input size).
🔄

Recurrent Neural Networks (RNNs)

Processing sequences — the idea before Transformers

RNNs process sequences by maintaining a hidden state that is updated at each time step. The same weights are applied at every step (weight sharing over time), making them suitable for text, audio, and time series. The core limitation — the vanishing gradient problem over long sequences — motivated LSTM and eventually the Transformer.

📝t=1x₁
🔄RNN cellh₁ = f(h₀, x₁)
🔄RNN cellh₂ = f(h₁, x₂)
🔄RNN cellh₃ = f(h₂, x₃)
🎯outputŷ
ht = tanh(Whh · ht-1 + Wxh · xt + bh)
h_t = new hidden state  ·  W_hh = recurrent weight  ·  W_xh = input weight  ·  tanh squashes to [-1, 1]
python · pytorch rnn
import torch import torch.nn as nn class VanillaRNN(nn.Module): def __init__(self, input_size, hidden_size, output_size, num_layers=1): super().__init__() self.hidden_size = hidden_size self.num_layers = num_layers self.rnn = nn.RNN( input_size=input_size, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, # input: (batch, seq_len, features) nonlinearity='tanh', dropout=0.2 if num_layers > 1 else 0, ) self.fc = nn.Linear(hidden_size, output_size) def forward(self, x, h0=None): # x: (batch, seq_len, input_size) out, hn = self.rnn(x, h0) # out: (batch, seq_len, hidden) # hn: (num_layers, batch, hidden) # Use last time step for classification out = self.fc(out[:, -1, :]) # (batch, output_size) return out, hn # Usage model = VanillaRNN(input_size=128, hidden_size=256, output_size=10) x = torch.randn(32, 50, 128) # batch=32, seq_len=50, features=128 out, _ = model(x) print(out.shape) # → (32, 10)
⚠️
Vanishing gradients: During backpropagation through time (BPTT), gradients are multiplied at every step. For long sequences, they shrink exponentially — the network can't learn long-range dependencies. This is why vanilla RNNs rarely work beyond ~20-50 time steps, and why LSTM was invented.
🔒

Long Short-Term Memory (LSTM)

Gated memory — solving the vanishing gradient problem

LSTM (Hochreiter & Schmidhuber, 1997) adds a cell state — a memory highway that runs straight through the network — controlled by three gates. This allows the network to selectively remember or forget information over arbitrarily long sequences, making it the dominant sequence model for a decade before Transformers.

The Three Gates
  • Forget gate (f): What to erase from cell state. f = σ(W_f · [h, x] + b_f)
  • Input gate (i): What new information to store. i = σ(W_i · [h, x] + b_i)
  • Output gate (o): What part of cell state to expose as hidden state. o = σ(W_o · [h, x] + b_o)
Cell State Update
  • Candidate: C̃ = tanh(W_c · [h, x] + b_c)
  • Update: C_t = f ⊙ C_{t-1} + i ⊙ C̃
  • Output: h_t = o ⊙ tanh(C_t)
  • σ = sigmoid (values 0-1 act as "how much to let through")
python · pytorch lstm
class LSTMModel(nn.Module): def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, num_classes): super().__init__() self.embedding = nn.Embedding(vocab_size, embed_dim) self.lstm = nn.LSTM( input_size=embed_dim, hidden_size=hidden_size, num_layers=num_layers, batch_first=True, dropout=0.3, bidirectional=True, # run in both directions — doubles hidden dim ) # bidirectional: hidden = 2 × hidden_size self.dropout = nn.Dropout(0.4) self.fc = nn.Linear(hidden_size * 2, num_classes) def forward(self, x): # x: (batch, seq_len) ← token ids embedded = self.dropout(self.embedding(x)) # (B, T, embed_dim) # out: (B, T, 2*H) | (h_n, c_n): (layers*2, B, H) each out, (h_n, c_n) = self.lstm(embedded) # Concatenate last hidden states from both directions h_last = torch.cat([h_n[-2], h_n[-1]], dim=1) # (B, 2*H) return self.fc(self.dropout(h_last)) # ── GRU — simpler alternative to LSTM ── # Merges forget + input into a single update gate. Fewer parameters, often comparable performance. self.gru = nn.GRU(input_size=embed_dim, hidden_size=hidden_size, num_layers=2, batch_first=True, bidirectional=True)
🔑
LSTM vs GRU: GRU (2014) simplifies LSTM by merging the forget and input gates into a single "update gate" and eliminating the cell state. GRU has fewer parameters, trains faster, and performs comparably on most tasks. Both are obsolete for long-range tasks — use Transformers — but still useful for short sequences where causal structure matters.

Transformers

"Attention Is All You Need" — the architecture powering modern AI

Introduced by Vaswani et al. in 2017, the Transformer replaces recurrence entirely with self-attention: every position in the sequence attends to every other position simultaneously, in parallel. This solved the long-range dependency problem and enabled massive GPU parallelism — the key to scaling.

🔤inputTokens
🗂️embeddingToken + Pos
👁️multi-head attnSelf-Attention
🔧feed-forwardFFN × N layers
🎯projectionLogits
python · pytorch transformer block
import torch import torch.nn as nn import math class TransformerBlock(nn.Module): """Single Transformer encoder layer (Pre-LN variant).""" def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1): super().__init__() self.attn = nn.MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True) self.ff = nn.Sequential( nn.Linear(d_model, d_ff), nn.GELU(), # GELU (not ReLU) used in most modern models nn.Dropout(dropout), nn.Linear(d_ff, d_model), ) self.norm1 = nn.LayerNorm(d_model) self.norm2 = nn.LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x, mask=None): # x: (batch, seq_len, d_model) # Pre-LN: normalize BEFORE attention (more stable than original Post-LN) x_norm = self.norm1(x) attn_out, _ = self.attn(x_norm, x_norm, x_norm, attn_mask=mask) x = x + self.dropout(attn_out) # residual connection # Feed-forward sublayer x = x + self.dropout(self.ff(self.norm2(x))) return x class TransformerLM(nn.Module): """Decoder-only language model (GPT-style).""" def __init__(self, vocab_size, d_model=512, n_layers=6, n_heads=8, max_len=1024): super().__init__() self.tok_emb = nn.Embedding(vocab_size, d_model) self.pos_emb = nn.Embedding(max_len, d_model) # learned positional embeddings self.blocks = nn.ModuleList([TransformerBlock(d_model, n_heads) for _ in range(n_layers)]) self.norm = nn.LayerNorm(d_model) self.head = nn.Linear(d_model, vocab_size, bias=False) def forward(self, idx): B, T = idx.shape tok = self.tok_emb(idx) pos = self.pos_emb(torch.arange(T, device=idx.device)) x = tok + pos # (B, T, d_model) causal_mask = nn.Transformer.generate_square_subsequent_mask(T) for block in self.blocks: x = block(x, mask=causal_mask) # each block attends causally logits = self.head(self.norm(x)) # (B, T, vocab_size) return logits
Encoder vs Decoder vs Encoder-Decoder
  • Encoder only (BERT): Bidirectional. Sees all tokens. Best for classification, NER, QA (extractive)
  • Decoder only (GPT): Causal. Each token only sees past. Best for generation
  • Encoder-Decoder (T5, BART): Input fully attended, then decoded auto-regressively. Best for translation, summarization
Critical Design Choices
  • Positional encoding: Sin/cos (original), learned, RoPE, ALiBi — tells the model token positions
  • Layer normalization: Pre-LN (inside residual) is more stable than original Post-LN
  • Activation: GELU preferred over ReLU in modern models
  • Residual connections: Essential — allow gradients to flow through deep stacks unchanged
🗂️

Embeddings

Mapping discrete symbols into dense vector spaces

An embedding is a learned mapping from a discrete symbol (word, token, user ID, pixel patch) to a dense vector. The embedding space captures semantic relationships: similar items land near each other. This is how neural networks work with discrete data — they turn it into a geometry the network can reason over.

Word2Vec / GloVe (Pre-Neural Era)
  • Trained by predicting surrounding words (Word2Vec) or global co-occurrence (GloVe)
  • Static: one vector per word, regardless of context
  • Famous property: king - man + woman ≈ queen
  • Still useful as lightweight baselines and for understanding the embedding concept
Contextual Embeddings (Transformer Era)
  • Each token gets a different vector depending on its context (BERT, GPT)
  • "bank" in "river bank" ≠ "bank" in "bank account"
  • The hidden states at each layer ARE the contextual embeddings
  • Sentence embeddings: pool or use [CLS] token for whole-sequence representation
python · embeddings
import torch import torch.nn as nn # ── nn.Embedding: learnable lookup table ── vocab_size = 30_000 embed_dim = 256 embedding = nn.Embedding(vocab_size, embed_dim) token_ids = torch.tensor([[1, 42, 9, 7]]) # (1, seq_len=4) vecs = embedding(token_ids) # (1, 4, 256) # ── Positional embeddings — same nn.Embedding, different purpose ── max_seq_len = 512 pos_embedding = nn.Embedding(max_seq_len, embed_dim) positions = torch.arange(token_ids.shape[1]) # [0, 1, 2, 3] pos_vecs = pos_embedding(positions) # (4, 256) x = vecs + pos_vecs # token + position # ── Sentence embeddings from HuggingFace ── from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') sentences = ["Deep learning is powerful", "Neural networks learn features"] embeddings = model.encode(sentences) # (2, 384) # Cosine similarity between sentences from sklearn.metrics.pairwise import cosine_similarity sim = cosine_similarity(embeddings[0:1], embeddings[1:2]) print(f"Similarity: {sim[0][0]:.3f}")
🔑
Embeddings are the interface: Every modality needs an embedding — images (patch embeddings in ViT), audio (waveform patches), code tokens, user IDs in recommendation systems. Understanding how embeddings work is the foundation for understanding multimodal models and RAG (retrieval-augmented generation) in LLMs.
👁️

Attention Mechanism

Scaled dot-product attention — the core of Transformers

Attention answers the question: for each position in the output, which positions in the input should I focus on? The Transformer's self-attention computes this as a weighted sum of values, where weights come from comparing a query vector against all key vectors.

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V
Q = Query matrix (what we're looking for)  ·  K = Key matrix (what each position offers)  ·  V = Value matrix (what each position contains)  ·  √d_k = scaling factor to prevent gradient vanishing in softmax
python · scaled dot-product attention from scratch
import torch import torch.nn.functional as F import math def scaled_dot_product_attention(Q, K, V, mask=None): """ Q, K, V: (batch, heads, seq_len, d_k) mask: (seq_len, seq_len) — True positions are masked out """ d_k = Q.shape[-1] # (B, H, T, T) — affinity scores for every query-key pair scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k) # Causal mask: prevent attending to future tokens if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) weights = F.softmax(scores, dim=-1) # attention distribution return torch.matmul(weights, V), weights # weighted sum of values class MultiHeadAttention(nn.Module): """Multi-head attention: run attention in parallel over h subspaces.""" def __init__(self, d_model, n_heads): super().__init__() assert d_model % n_heads == 0 self.d_k = d_model // n_heads self.n_heads= n_heads # Single weight matrix for Q, K, V (efficient) self.W_qkv = nn.Linear(d_model, 3 * d_model, bias=False) self.W_out = nn.Linear(d_model, d_model, bias=False) def forward(self, x, mask=None): B, T, C = x.shape # Project and split into heads qkv = self.W_qkv(x).chunk(3, dim=-1) # 3 × (B, T, C) Q, K, V = [t.view(B, T, self.n_heads, self.d_k) .transpose(1, 2) for t in qkv] # (B, H, T, d_k) attn_out, weights = scaled_dot_product_attention(Q, K, V, mask) # Merge heads back and project out = attn_out.transpose(1, 2).contiguous().view(B, T, C) return self.W_out(out), weights
Why Multi-Head?

Running attention in h parallel heads, each with a lower-dimensional subspace, lets the model attend to different aspects simultaneously — one head might track syntactic relationships, another semantic similarity, another positional proximity. The outputs are concatenated and projected back to the original dimension.

Variants
  • Cross-attention: Q from decoder, K and V from encoder — used in encoder-decoder models for translation
  • Grouped Query Attention (GQA): Fewer K/V heads than Q heads — reduces KV cache memory at inference. Used in Llama 3, Mistral
  • Flash Attention: IO-aware recomputation — 2-4× faster on hardware. Doesn't change math, only computation order
📐
Quadratic complexity: Self-attention is O(T²·d) in both time and memory — every token attends to every other token. For T=1024 this is fine. For T=100K (long documents), it's a bottleneck. This motivates sparse attention (Longformer), linear attention, and sliding window attention (Mistral).
♻️

Transfer Learning

Reusing representations learned on large datasets

Training deep models from scratch requires massive datasets and compute. Transfer learning sidesteps this: take a model pre-trained on a large general dataset (ImageNet, The Pile, Common Crawl) and adapt it to your specific task. The model has already learned useful low-level features (edges, textures, grammar, factual knowledge) that transfer.

Feature Extraction (Frozen)

Freeze all pre-trained weights. Only train a new classification head on top. Fast, memory-efficient, works with tiny datasets. Use when your data is very small or very different from the pre-training domain.

Full Fine-tuning

Unfreeze all (or most) layers and continue training on your data, typically with a much smaller learning rate. More expensive but reaches higher accuracy when you have sufficient task data (thousands+ examples).

python · transfer learning with torchvision
import torchvision.models as models import torch.nn as nn NUM_CLASSES = 5 # ── Load pre-trained ResNet-50 ── model = models.resnet50(weights='IMAGENET1K_V2') # ── Strategy 1: Feature extraction — freeze backbone ── for param in model.parameters(): param.requires_grad = False # Replace final FC layer (the head) — only this trains model.fc = nn.Sequential( nn.Linear(model.fc.in_features, 256), nn.ReLU(), nn.Dropout(0.4), nn.Linear(256, NUM_CLASSES) ) # ── Strategy 2: Fine-tune everything ── model = models.resnet50(weights='IMAGENET1K_V2') model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES) # Differential learning rates: lower lr for backbone, higher for head optimizer = torch.optim.AdamW([ {'params': [p for n, p in model.named_parameters() if 'fc' not in n], 'lr': 1e-5}, # backbone: small lr {'params': model.fc.parameters(), 'lr': 1e-3}, # head: normal lr ]) # ── Strategy 3: Progressive unfreezing (ULMFiT pattern) ── def unfreeze_layers(model, n): """Unfreeze last n layers of model.children()""" children = list(model.children()) for child in children[-n:]: for param in child.parameters(): param.requires_grad = True unfreeze_layers(model, 3) # start with last 3 blocks
Practical transfer learning recipe: (1) Freeze backbone + train head for 2-3 epochs. (2) Unfreeze + fine-tune with 10× smaller LR for 5-10 epochs. (3) Use cosine LR schedule with warmup. This approach consistently outperforms training from scratch with fewer than 50K samples.
🎛️

Fine-tuning

Adapting pre-trained language models to specific tasks

In the language model context, fine-tuning adapts a pre-trained model (BERT, GPT-2, LLaMA) to a downstream task using a comparatively small labeled dataset. The entire weight space was shaped during pre-training; fine-tuning nudges those weights toward the task distribution.

Full Fine-tuning

Update all parameters. Expensive, requires significant GPU memory proportional to model size. Risk of catastrophic forgetting. Use when task data is large and task is very different from pre-training.

LoRA (Low-Rank Adaptation)

Freeze the model. Inject trainable low-rank matrices into attention layers. Trains 0.1–1% of parameters. State of the art for LLM adaptation. Memory footprint is tiny.

Prompt Tuning / Prefix

Freeze the entire model. Only train a small set of "soft prompt" tokens prepended to the input. Extreme parameter efficiency but lower ceiling for task-specific adaptation.

python · huggingface fine-tuning (sequence classification)
from transformers import ( AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding ) from datasets import load_dataset import evaluate import numpy as np MODEL_NAME = "bert-base-uncased" # ── Load tokenizer + model ── tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForSequenceClassification.from_pretrained( MODEL_NAME, num_labels=2 ) # ── Tokenize dataset ── dataset = load_dataset("imdb") def preprocess(examples): return tokenizer(examples["text"], truncation=True, max_length=512) tokenized = dataset.map(preprocess, batched=True) # ── Define metric ── accuracy = evaluate.load("accuracy") def compute_metrics(eval_pred): logits, labels = eval_pred predictions = np.argmax(logits, axis=-1) return accuracy.compute(predictions=predictions, references=labels) # ── Training arguments ── args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=32, learning_rate=2e-5, # much smaller than training from scratch weight_decay=0.01, lr_scheduler_type="cosine", # decay LR over training warmup_ratio=0.1, # warmup for first 10% of steps eval_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, fp16=True, # half-precision training — 2× faster ) trainer = Trainer( model=model, args=args, train_dataset=tokenized["train"], eval_dataset=tokenized["test"], tokenizer=tokenizer, data_collator=DataCollatorWithPadding(tokenizer), compute_metrics=compute_metrics, ) trainer.train()

LoRA Fine-tuning (Parameter-Efficient)

python · peft / lora
from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=16, # rank — higher = more capacity, more params lora_alpha=32, # scaling factor (alpha/r is the effective scale) target_modules=["q_proj", "v_proj"], # inject into attention Q and V lora_dropout=0.05, bias="none", ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 2,097,152 || all params: 1,237,026,816 || trainable%: 0.17%
🔑
LoRA intuition: Instead of learning a full weight delta ΔW (d×d matrix), LoRA decomposes it as ΔW = A·B where A is d×r and B is r×d, with rank r ≪ d. For d=4096 and r=16 that's 4096² ≈ 16M vs 2×4096×16 ≈ 131K parameters — a 120× reduction. Empirically it matches full fine-tuning performance on most tasks.
📉

Training Loop Essentials

The machinery behind learning — loss, optimizers, schedulers
python · complete pytorch training loop
from torch.utils.data import DataLoader import torch import torch.nn as nn def train(model, train_loader, val_loader, n_epochs=10, device='cuda'): optimizer = torch.optim.AdamW( model.parameters(), lr=3e-4, weight_decay=0.01 ) scheduler = torch.optim.lr_scheduler.OneCycleLR( optimizer, max_lr=3e-4, steps_per_epoch=len(train_loader), epochs=n_epochs, ) criterion = nn.CrossEntropyLoss() scaler = torch.cuda.amp.GradScaler() # automatic mixed precision best_val_acc, patience_counter = 0, 0 for epoch in range(n_epochs): # ── Training phase ── model.train() train_loss = 0 for x, y in train_loader: x, y = x.to(device), y.to(device) optimizer.zero_grad(set_to_none=True) # slightly faster than zero_grad() with torch.autocast(device_type='cuda', dtype=torch.float16): logits = model(x) loss = criterion(logits, y) scaler.scale(loss).backward() scaler.unscale_(optimizer) torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) scaler.step(optimizer) scaler.update() scheduler.step() train_loss += loss.item() # ── Validation phase ── model.eval() correct = total = 0 with torch.no_grad(): for x, y in val_loader: x, y = x.to(device), y.to(device) preds = model(x).argmax(dim=1) correct += (preds == y).sum().item() total += y.size(0) val_acc = correct / total print(f"Epoch {epoch+1} | loss {train_loss/len(train_loader):.4f} | val_acc {val_acc:.4f}") # ── Early stopping ── if val_acc > best_val_acc: best_val_acc = val_acc torch.save(model.state_dict(), 'best.pt') patience_counter = 0 else: patience_counter += 1 if patience_counter >= 5: break
ComponentCommon ChoiceWhen to Change
OptimizerAdamWSGD+momentum for large vision models; Adafactor for huge LLMs (saves memory)
Learning rate3e-4 (new), 2e-5 (fine-tune)Use LR finder or cosine schedule with warmup; always schedule
Loss (classification)CrossEntropyLossLabel smoothing (0.1) for overconfident models; focal loss for class imbalance
Loss (regression)MSELossHuber loss (SmoothL1) when outliers are present
SchedulerOneCycleLRCosine annealing with warmup for Transformer fine-tuning
Gradient clippingmax_norm=1.0Always apply for RNNs and Transformers; reduces instability
Mixed precisiontorch.autocast + GradScalerAlways — 2× speedup and 2× memory reduction on modern GPUs for free
🛡️

Regularization

Preventing overfitting — the most common real-world challenge
Dropout

Randomly zero out activations during training with probability p. Forces the network to not rely on any single feature. Equivalent to training an ensemble of 2ⁿ networks. Disable during inference (model.eval() handles this automatically).

nn.Dropout(p=0.5) — typical; use 0.1–0.2 in Transformers

Batch Normalization

Normalize layer inputs to zero mean, unit variance per-batch, then learn scale/shift. Reduces internal covariate shift. Enables higher learning rates. A form of regularization through noise injection (batch statistics vary). Standard in CNNs; replaced by Layer Norm in Transformers.

Weight Decay (L2)

Add penalty λ||W||² to the loss, shrinking weights toward zero. In AdamW, this is decoupled from the adaptive learning rate — use AdamW, not Adam + L2. Typical: weight_decay=0.01.

Data Augmentation

Artificially expand training data by applying label-preserving transforms. For vision: random crop, flip, color jitter, mixup, cutmix. For text: back-translation, random deletion/swap, paraphrase generation. Most effective regularizer when data is scarce.

Regularization priority: (1) Get more data / augment first. (2) Reduce model capacity (fewer layers / filters). (3) Add dropout. (4) Add weight decay. Never add all at once — you won't know what helped. And always monitor train vs. validation loss to diagnose overfitting vs. underfitting.
🗒️

Cheat Sheet

Quick reference for architecture selection and hyperparameter defaults

Architecture Selection

TaskArchitectureStarting Point
Image classificationCNN → ResNet/EfficientNettorchvision.models.resnet50(weights='IMAGENET1K_V2')
Object detectionCNN backbone + headYOLOv8, DETR, Faster R-CNN
Text classificationTransformer encoderbert-base-uncased via HuggingFace
Text generationTransformer decodergpt2 or Llama-3.2-1B via HuggingFace
Seq-to-seq (translation, summarization)Encoder-decoder Transformert5-base or facebook/bart-large-cnn
Sentence embeddingsBERT-style + poolingsentence-transformers/all-MiniLM-L6-v2
Time series (short)LSTM / GRUnn.LSTM(input_size, hidden, num_layers=2, batch_first=True)
Time series (long-range)Temporal TransformerTemporal Fusion Transformer, PatchTST

Hyperparameter Defaults

HyperparameterDefaultNotes
Learning rate (from scratch)1e-3 to 3e-4Use LR warmup + cosine decay
Learning rate (fine-tuning)1e-5 to 5e-510-100× smaller than from-scratch
Batch size32 (vision), 16 (language)Larger batches → more stable but wider minima; use gradient accumulation if OOM
Dropout (CNN)0.3–0.5Applied before final FC layer
Dropout (Transformer)0.1Applied to attention weights and FFN
Weight decay0.01Applies to weights, not biases/norms
Gradient clip1.0Essential for RNNs, recommended for Transformers
Warmup steps5–10% of total stepsCrucial for large models to avoid early instability
LoRA rank r8–16Start at 8; increase if performance plateaus

Debugging Loss Curves

Overfitting
  • Training loss ↓, validation loss ↑ (diverges after some point)
  • Fix: more data, augmentation, dropout, weight decay, reduce model size, early stopping
Underfitting
  • Both train and val loss stay high, don't converge
  • Fix: larger model, more epochs, higher learning rate, less regularization
Exploding Gradients
  • Loss becomes NaN or jumps wildly
  • Fix: gradient clipping (max_norm=1.0), lower LR, use mixed precision carefully
Loss Not Moving
  • Loss flat from epoch 1
  • Fix: check LR (too small), check data pipeline (all same class?), verify model.train() is called, check gradient flow with param.grad
📚
Essential resources: Andrej Karpathy's "Neural Networks: Zero to Hero" series (builds everything from scratch) · fast.ai Practical Deep Learning · Dive into Deep Learning (d2l.ai) · HuggingFace course (free) · Stanford CS231n (vision) and CS224n (NLP).