Pre-LLM Foundations

Deep Learning
Handbook

Essential architectures and concepts every ML practitioner must know before diving into large language models — CNNs, RNNs, LSTMs, Transformers, attention, embeddings, transfer learning, and fine-tuning.

PyTorch TensorFlow / Keras Transformers CNNs RNNs · LSTMs Attention Embeddings Transfer Learning

🧠

Why Deep Learning Before LLMs

The conceptual stack — what you need to understand first

Large language models like GPT-4 or Claude are not magic — they are Transformer architectures trained at scale. Understanding that architecture, and the progression from basic neural networks to CNNs to RNNs to the Transformer, gives you a working mental model for everything in modern AI. Without it, fine-tuning becomes cargo-culting and debugging becomes guesswork.

Layers & Parameters

Deep networks are function compositions. Each layer learns a representation. More layers → more abstract features. Parameters are learned via backpropagation + gradient descent.

Representations

Deep learning is representation learning. Raw pixels → edges → shapes → objects. Raw tokens → subword embeddings → contextual meaning. Each concept builds on the last.

The Scaling Hypothesis

At sufficient scale, the same architecture — stacked Transformer blocks — handles vision, language, code, and multimodal tasks. Understanding the unit helps you reason about the whole.

📌

Reading order: Frameworks → CNN → RNN → LSTM → Transformer → Embeddings → Attention → Transfer Learning → Fine-tuning. Each section links to the next conceptually.

⚙️

Frameworks: PyTorch vs TensorFlow/Keras

When to use each, and how they compare

⚡ PyTorch

ParadigmDefine-by-Run (dynamic graph)

Best forResearch, custom architectures

DebuggingNative Python, easy pdb/print

EcosystemHuggingFace, Lightning, torchvision

Industry adoptionDominant in research (2024)

🔶 TensorFlow / Keras

ParadigmStatic graph (+ eager mode)

Best forProduction pipelines, mobile (TFLite)

DeploymentTF Serving, TF.js, TPU-native

Keras 3Multi-backend: TF, JAX, PyTorch

Industry adoptionStrong in production / GCP

PyTorch — Core Concepts

python · pytorch

import torch
import torch.nn as nn
import torch.optim as optim

# ── Tensors — the fundamental data structure ──
x = torch.tensor([[1.0, 2.0], [3.0, 4.0]])   # shape (2,2)
x = torch.randn(32, 3, 224, 224)             # batch of 32 RGB images
x = x.to('cuda')                                # move to GPU

# ── Defining a model ──
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

model = SimpleNet().to('cuda')

# ── Loss + optimizer ──
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# ── One training step ──
optimizer.zero_grad()               # clear gradients
output = model(x_batch)              # forward pass
loss = criterion(output, y_batch)    # compute loss
loss.backward()                       # backprop → compute gradients
optimizer.step()                      # update weights

TensorFlow / Keras — Core Concepts

python · keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# ── Sequential API (simple pipelines) ──
model = keras.Sequential([
    layers.Dense(256, activation='relu', input_shape=(784,)),
    layers.Dropout(0.3),
    layers.Dense(10,  activation='softmax'),
])

# ── Functional API (multiple inputs/outputs, shared layers) ──
inputs  = keras.Input(shape=(784,))
x       = layers.Dense(256, activation='relu')(inputs)
outputs = layers.Dense(10,  activation='softmax')(x)
model   = keras.Model(inputs, outputs)

# ── Compile + train (high-level) ──
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.fit(x_train, y_train, epochs=10, batch_size=32,
         validation_data=(x_val, y_val))

# ── Custom training loop ──
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        pred = model(x, training=True)
        loss = loss_fn(y, pred)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss

✅

Rule of thumb: If you're reading papers and writing custom layers, use PyTorch. If you're deploying to production on GCP or building mobile apps, TensorFlow. For learning purposes, PyTorch's pythonic style makes debugging far easier for beginners.

🔲

Convolutional Neural Networks (CNNs)

Spatial feature extraction — the backbone of computer vision

CNNs exploit spatial locality: instead of connecting every neuron to every pixel (fully connected), a small kernel slides across the input, sharing weights. This gives translation invariance and drastically reduces parameters. They are the canonical architecture for images, and their ideas (locality, hierarchy, weight sharing) influenced all subsequent architectures.

🖼️inputImage (H×W×C)

→

🔲conv + reluFeature Maps

→

⬇️poolSpatial Reduction

→

🔲deeper convAbstract Features

→

➡️flattenFeature Vector

→

🎯fc + softmaxLogits

python · pytorch cnn

import torch.nn as nn

class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        # ── Feature extractor ──
        self.features = nn.Sequential(
            # Conv block 1: 3×3 kernel, 32 filters, same padding
            nn.Conv2d(3, 32, kernel_size=3, padding=1), # (B,3,H,W)→(B,32,H,W)
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),               # halve spatial dims

            # Conv block 2: 64 filters
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2, 2),

            # Conv block 3: 128 filters
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(inplace=True),
            nn.AdaptiveAvgPool2d((4, 4)),   # fixed output size
        )

        # ── Classifier ──
        self.classifier = nn.Sequential(
            nn.Flatten(),                    # (B,128,4,4) → (B,2048)
            nn.Linear(128 * 4 * 4, 512),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(512, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        return self.classifier(x)

# ── Output shape inspection ──
model = ConvNet(num_classes=10)
x = torch.randn(4, 3, 64, 64)   # batch=4, RGB, 64×64
print(model(x).shape)              # → torch.Size([4, 10])

Key Operations

Convolution: Slide a small kernel, compute dot products. Detects local patterns (edges, textures, shapes)
ReLU: max(0, x) — non-linearity. Without it, stacking layers collapses to a linear function
BatchNorm: Normalize activations per-batch. Stabilizes training, acts as regularizer
MaxPool: Take the max in a window — reduces spatial size, builds invariance

Landmark Architectures

LeNet-5 (1998): First successful CNN — digit recognition
AlexNet (2012): Ignited the deep learning revolution — won ImageNet by a wide margin
ResNet (2015): Skip connections solve the vanishing gradient problem — enables 100+ layer networks
EfficientNet (2019): Compound scaling — the default choice today for vision tasks

📐

Output size formula: For a conv layer with input size W, kernel K, padding P, stride S: out = ⌊(W - K + 2P) / S⌋ + 1. With K=3, P=1, S=1 you get "same" padding (output = input size).

🔄

Recurrent Neural Networks (RNNs)

Processing sequences — the idea before Transformers

RNNs process sequences by maintaining a hidden state that is updated at each time step. The same weights are applied at every step (weight sharing over time), making them suitable for text, audio, and time series. The core limitation — the vanishing gradient problem over long sequences — motivated LSTM and eventually the Transformer.

📝t=1x₁

→

🔄RNN cellh₁ = f(h₀, x₁)

→

🔄RNN cellh₂ = f(h₁, x₂)

→

🔄RNN cellh₃ = f(h₂, x₃)

→

🎯outputŷ

h_t = tanh(W_hh · h_t-1 + W_xh · x_t + b_h)

h_t = new hidden state · W_hh = recurrent weight · W_xh = input weight · tanh squashes to [-1, 1]

python · pytorch rnn

import torch
import torch.nn as nn

class VanillaRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, num_layers=1):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers  = num_layers

        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,  # input: (batch, seq_len, features)
            nonlinearity='tanh',
            dropout=0.2 if num_layers > 1 else 0,
        )
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x, h0=None):
        # x: (batch, seq_len, input_size)
        out, hn = self.rnn(x, h0)   # out: (batch, seq_len, hidden)
                                        # hn:  (num_layers, batch, hidden)

        # Use last time step for classification
        out = self.fc(out[:, -1, :])  # (batch, output_size)
        return out, hn

# Usage
model = VanillaRNN(input_size=128, hidden_size=256, output_size=10)
x = torch.randn(32, 50, 128)   # batch=32, seq_len=50, features=128
out, _ = model(x)
print(out.shape)                  # → (32, 10)

⚠️

Vanishing gradients: During backpropagation through time (BPTT), gradients are multiplied at every step. For long sequences, they shrink exponentially — the network can't learn long-range dependencies. This is why vanilla RNNs rarely work beyond ~20-50 time steps, and why LSTM was invented.

🔒

Long Short-Term Memory (LSTM)

Gated memory — solving the vanishing gradient problem

LSTM (Hochreiter & Schmidhuber, 1997) adds a cell state — a memory highway that runs straight through the network — controlled by three gates. This allows the network to selectively remember or forget information over arbitrarily long sequences, making it the dominant sequence model for a decade before Transformers.

The Three Gates

Forget gate (f): What to erase from cell state. f = σ(W_f · [h, x] + b_f)
Input gate (i): What new information to store. i = σ(W_i · [h, x] + b_i)
Output gate (o): What part of cell state to expose as hidden state. o = σ(W_o · [h, x] + b_o)

Cell State Update

Candidate: C̃ = tanh(W_c · [h, x] + b_c)
Update: C_t = f ⊙ C_{t-1} + i ⊙ C̃
Output: h_t = o ⊙ tanh(C_t)
σ = sigmoid (values 0-1 act as "how much to let through")

python · pytorch lstm

class LSTMModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(
            input_size=embed_dim,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=0.3,
            bidirectional=True,   # run in both directions — doubles hidden dim
        )
        # bidirectional: hidden = 2 × hidden_size
        self.dropout = nn.Dropout(0.4)
        self.fc      = nn.Linear(hidden_size * 2, num_classes)

    def forward(self, x):
        # x: (batch, seq_len)  ←  token ids
        embedded = self.dropout(self.embedding(x))   # (B, T, embed_dim)

        # out: (B, T, 2*H)  |  (h_n, c_n): (layers*2, B, H) each
        out, (h_n, c_n) = self.lstm(embedded)

        # Concatenate last hidden states from both directions
        h_last = torch.cat([h_n[-2], h_n[-1]], dim=1)  # (B, 2*H)
        return self.fc(self.dropout(h_last))

# ── GRU — simpler alternative to LSTM ──
# Merges forget + input into a single update gate. Fewer parameters, often comparable performance.
self.gru = nn.GRU(input_size=embed_dim, hidden_size=hidden_size,
                   num_layers=2, batch_first=True, bidirectional=True)

🔑

LSTM vs GRU: GRU (2014) simplifies LSTM by merging the forget and input gates into a single "update gate" and eliminating the cell state. GRU has fewer parameters, trains faster, and performs comparably on most tasks. Both are obsolete for long-range tasks — use Transformers — but still useful for short sequences where causal structure matters.

⚡

Transformers

"Attention Is All You Need" — the architecture powering modern AI

Introduced by Vaswani et al. in 2017, the Transformer replaces recurrence entirely with self-attention: every position in the sequence attends to every other position simultaneously, in parallel. This solved the long-range dependency problem and enabled massive GPU parallelism — the key to scaling.

🔤inputTokens

→

🗂️embeddingToken + Pos

→

👁️multi-head attnSelf-Attention

→

🔧feed-forwardFFN × N layers

→

🎯projectionLogits

python · pytorch transformer block

import torch
import torch.nn as nn
import math

class TransformerBlock(nn.Module):
    """Single Transformer encoder layer (Pre-LN variant)."""
    def __init__(self, d_model=512, n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.attn    = nn.MultiheadAttention(d_model, n_heads, dropout=dropout, batch_first=True)
        self.ff      = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),              # GELU (not ReLU) used in most modern models
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
        )
        self.norm1   = nn.LayerNorm(d_model)
        self.norm2   = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # x: (batch, seq_len, d_model)

        # Pre-LN: normalize BEFORE attention (more stable than original Post-LN)
        x_norm = self.norm1(x)
        attn_out, _ = self.attn(x_norm, x_norm, x_norm, attn_mask=mask)
        x = x + self.dropout(attn_out)   # residual connection

        # Feed-forward sublayer
        x = x + self.dropout(self.ff(self.norm2(x)))
        return x

class TransformerLM(nn.Module):
    """Decoder-only language model (GPT-style)."""
    def __init__(self, vocab_size, d_model=512, n_layers=6, n_heads=8, max_len=1024):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)  # learned positional embeddings
        self.blocks  = nn.ModuleList([TransformerBlock(d_model, n_heads) for _ in range(n_layers)])
        self.norm    = nn.LayerNorm(d_model)
        self.head    = nn.Linear(d_model, vocab_size, bias=False)

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = tok + pos                         # (B, T, d_model)

        causal_mask = nn.Transformer.generate_square_subsequent_mask(T)
        for block in self.blocks:
            x = block(x, mask=causal_mask)   # each block attends causally

        logits = self.head(self.norm(x))     # (B, T, vocab_size)
        return logits

Encoder vs Decoder vs Encoder-Decoder

Encoder only (BERT): Bidirectional. Sees all tokens. Best for classification, NER, QA (extractive)
Decoder only (GPT): Causal. Each token only sees past. Best for generation
Encoder-Decoder (T5, BART): Input fully attended, then decoded auto-regressively. Best for translation, summarization

Critical Design Choices

Positional encoding: Sin/cos (original), learned, RoPE, ALiBi — tells the model token positions
Layer normalization: Pre-LN (inside residual) is more stable than original Post-LN
Activation: GELU preferred over ReLU in modern models
Residual connections: Essential — allow gradients to flow through deep stacks unchanged

🗂️

Embeddings

Mapping discrete symbols into dense vector spaces

An embedding is a learned mapping from a discrete symbol (word, token, user ID, pixel patch) to a dense vector. The embedding space captures semantic relationships: similar items land near each other. This is how neural networks work with discrete data — they turn it into a geometry the network can reason over.

Word2Vec / GloVe (Pre-Neural Era)

Trained by predicting surrounding words (Word2Vec) or global co-occurrence (GloVe)
Static: one vector per word, regardless of context
Famous property: king - man + woman ≈ queen
Still useful as lightweight baselines and for understanding the embedding concept

Contextual Embeddings (Transformer Era)

Each token gets a different vector depending on its context (BERT, GPT)
"bank" in "river bank" ≠ "bank" in "bank account"
The hidden states at each layer ARE the contextual embeddings
Sentence embeddings: pool or use [CLS] token for whole-sequence representation

python · embeddings

import torch
import torch.nn as nn

# ── nn.Embedding: learnable lookup table ──
vocab_size = 30_000
embed_dim  = 256

embedding = nn.Embedding(vocab_size, embed_dim)
token_ids = torch.tensor([[1, 42, 9, 7]])    # (1, seq_len=4)
vecs = embedding(token_ids)                    # (1, 4, 256)

# ── Positional embeddings — same nn.Embedding, different purpose ──
max_seq_len = 512
pos_embedding = nn.Embedding(max_seq_len, embed_dim)
positions = torch.arange(token_ids.shape[1])   # [0, 1, 2, 3]
pos_vecs  = pos_embedding(positions)              # (4, 256)
x = vecs + pos_vecs                                # token + position

# ── Sentence embeddings from HuggingFace ──
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
sentences = ["Deep learning is powerful", "Neural networks learn features"]
embeddings = model.encode(sentences)              # (2, 384)

# Cosine similarity between sentences
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity(embeddings[0:1], embeddings[1:2])
print(f"Similarity: {sim[0][0]:.3f}")

🔑

Embeddings are the interface: Every modality needs an embedding — images (patch embeddings in ViT), audio (waveform patches), code tokens, user IDs in recommendation systems. Understanding how embeddings work is the foundation for understanding multimodal models and RAG (retrieval-augmented generation) in LLMs.

👁️

Attention Mechanism

Scaled dot-product attention — the core of Transformers

Attention answers the question: for each position in the output, which positions in the input should I focus on? The Transformer's self-attention computes this as a weighted sum of values, where weights come from comparing a query vector against all key vectors.

Attention(Q, K, V) = softmax( Q·Kᵀ / √d_k ) · V

Q = Query matrix (what we're looking for) · K = Key matrix (what each position offers) · V = Value matrix (what each position contains) · √d_k = scaling factor to prevent gradient vanishing in softmax

python · scaled dot-product attention from scratch

import torch
import torch.nn.functional as F
import math

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (batch, heads, seq_len, d_k)
    mask:    (seq_len, seq_len) — True positions are masked out
    """
    d_k = Q.shape[-1]

    # (B, H, T, T) — affinity scores for every query-key pair
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)

    # Causal mask: prevent attending to future tokens
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    weights = F.softmax(scores, dim=-1)   # attention distribution
    return torch.matmul(weights, V), weights   # weighted sum of values


class MultiHeadAttention(nn.Module):
    """Multi-head attention: run attention in parallel over h subspaces."""
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k    = d_model // n_heads
        self.n_heads= n_heads
        # Single weight matrix for Q, K, V (efficient)
        self.W_qkv  = nn.Linear(d_model, 3 * d_model, bias=False)
        self.W_out  = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None):
        B, T, C = x.shape

        # Project and split into heads
        qkv = self.W_qkv(x).chunk(3, dim=-1)           # 3 × (B, T, C)
        Q, K, V = [t.view(B, T, self.n_heads, self.d_k)
                    .transpose(1, 2) for t in qkv]       # (B, H, T, d_k)

        attn_out, weights = scaled_dot_product_attention(Q, K, V, mask)

        # Merge heads back and project
        out = attn_out.transpose(1, 2).contiguous().view(B, T, C)
        return self.W_out(out), weights

Why Multi-Head?

Running attention in h parallel heads, each with a lower-dimensional subspace, lets the model attend to different aspects simultaneously — one head might track syntactic relationships, another semantic similarity, another positional proximity. The outputs are concatenated and projected back to the original dimension.

Variants

Cross-attention: Q from decoder, K and V from encoder — used in encoder-decoder models for translation
Grouped Query Attention (GQA): Fewer K/V heads than Q heads — reduces KV cache memory at inference. Used in Llama 3, Mistral
Flash Attention: IO-aware recomputation — 2-4× faster on hardware. Doesn't change math, only computation order

📐

Quadratic complexity: Self-attention is O(T²·d) in both time and memory — every token attends to every other token. For T=1024 this is fine. For T=100K (long documents), it's a bottleneck. This motivates sparse attention (Longformer), linear attention, and sliding window attention (Mistral).

♻️

Transfer Learning

Reusing representations learned on large datasets

Training deep models from scratch requires massive datasets and compute. Transfer learning sidesteps this: take a model pre-trained on a large general dataset (ImageNet, The Pile, Common Crawl) and adapt it to your specific task. The model has already learned useful low-level features (edges, textures, grammar, factual knowledge) that transfer.

Feature Extraction (Frozen)

Freeze all pre-trained weights. Only train a new classification head on top. Fast, memory-efficient, works with tiny datasets. Use when your data is very small or very different from the pre-training domain.

Full Fine-tuning

Unfreeze all (or most) layers and continue training on your data, typically with a much smaller learning rate. More expensive but reaches higher accuracy when you have sufficient task data (thousands+ examples).

python · transfer learning with torchvision

import torchvision.models as models
import torch.nn as nn

NUM_CLASSES = 5

# ── Load pre-trained ResNet-50 ──
model = models.resnet50(weights='IMAGENET1K_V2')

# ── Strategy 1: Feature extraction — freeze backbone ──
for param in model.parameters():
    param.requires_grad = False

# Replace final FC layer (the head) — only this trains
model.fc = nn.Sequential(
    nn.Linear(model.fc.in_features, 256),
    nn.ReLU(),
    nn.Dropout(0.4),
    nn.Linear(256, NUM_CLASSES)
)

# ── Strategy 2: Fine-tune everything ──
model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)

# Differential learning rates: lower lr for backbone, higher for head
optimizer = torch.optim.AdamW([
    {'params': [p for n, p in model.named_parameters() if 'fc' not in n],
     'lr': 1e-5},   # backbone: small lr
    {'params': model.fc.parameters(),
     'lr': 1e-3},   # head: normal lr
])

# ── Strategy 3: Progressive unfreezing (ULMFiT pattern) ──
def unfreeze_layers(model, n):
    """Unfreeze last n layers of model.children()"""
    children = list(model.children())
    for child in children[-n:]:
        for param in child.parameters():
            param.requires_grad = True

unfreeze_layers(model, 3)   # start with last 3 blocks

✅

Practical transfer learning recipe: (1) Freeze backbone + train head for 2-3 epochs. (2) Unfreeze + fine-tune with 10× smaller LR for 5-10 epochs. (3) Use cosine LR schedule with warmup. This approach consistently outperforms training from scratch with fewer than 50K samples.

🎛️

Fine-tuning

Adapting pre-trained language models to specific tasks

In the language model context, fine-tuning adapts a pre-trained model (BERT, GPT-2, LLaMA) to a downstream task using a comparatively small labeled dataset. The entire weight space was shaped during pre-training; fine-tuning nudges those weights toward the task distribution.

Full Fine-tuning

Update all parameters. Expensive, requires significant GPU memory proportional to model size. Risk of catastrophic forgetting. Use when task data is large and task is very different from pre-training.

LoRA (Low-Rank Adaptation)

Freeze the model. Inject trainable low-rank matrices into attention layers. Trains 0.1–1% of parameters. State of the art for LLM adaptation. Memory footprint is tiny.

Prompt Tuning / Prefix

Freeze the entire model. Only train a small set of "soft prompt" tokens prepended to the input. Extreme parameter efficiency but lower ceiling for task-specific adaptation.

python · huggingface fine-tuning (sequence classification)

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer, DataCollatorWithPadding
)
from datasets import load_dataset
import evaluate
import numpy as np

MODEL_NAME = "bert-base-uncased"

# ── Load tokenizer + model ──
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model     = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, num_labels=2
)

# ── Tokenize dataset ──
dataset = load_dataset("imdb")

def preprocess(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)

tokenized = dataset.map(preprocess, batched=True)

# ── Define metric ──
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

# ── Training arguments ──
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,           # much smaller than training from scratch
    weight_decay=0.01,
    lr_scheduler_type="cosine",   # decay LR over training
    warmup_ratio=0.1,              # warmup for first 10% of steps
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=True,                     # half-precision training — 2× faster
)

trainer = Trainer(
    model=model, args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)
trainer.train()

LoRA Fine-tuning (Parameter-Efficient)

python · peft / lora

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                   # rank — higher = more capacity, more params
    lora_alpha=32,          # scaling factor (alpha/r is the effective scale)
    target_modules=["q_proj", "v_proj"],   # inject into attention Q and V
    lora_dropout=0.05,
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 2,097,152 || all params: 1,237,026,816 || trainable%: 0.17%

🔑

LoRA intuition: Instead of learning a full weight delta ΔW (d×d matrix), LoRA decomposes it as ΔW = A·B where A is d×r and B is r×d, with rank r ≪ d. For d=4096 and r=16 that's 4096² ≈ 16M vs 2×4096×16 ≈ 131K parameters — a 120× reduction. Empirically it matches full fine-tuning performance on most tasks.

📉

Training Loop Essentials

The machinery behind learning — loss, optimizers, schedulers

python · complete pytorch training loop

from torch.utils.data import DataLoader
import torch
import torch.nn as nn

def train(model, train_loader, val_loader, n_epochs=10, device='cuda'):
    optimizer = torch.optim.AdamW(
        model.parameters(), lr=3e-4, weight_decay=0.01
    )
    scheduler = torch.optim.lr_scheduler.OneCycleLR(
        optimizer, max_lr=3e-4,
        steps_per_epoch=len(train_loader),
        epochs=n_epochs,
    )
    criterion = nn.CrossEntropyLoss()
    scaler    = torch.cuda.amp.GradScaler()   # automatic mixed precision

    best_val_acc, patience_counter = 0, 0

    for epoch in range(n_epochs):
        # ── Training phase ──
        model.train()
        train_loss = 0
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad(set_to_none=True)  # slightly faster than zero_grad()

            with torch.autocast(device_type='cuda', dtype=torch.float16):
                logits = model(x)
                loss   = criterion(logits, y)

            scaler.scale(loss).backward()
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            scaler.step(optimizer)
            scaler.update()
            scheduler.step()
            train_loss += loss.item()

        # ── Validation phase ──
        model.eval()
        correct = total = 0
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                preds = model(x).argmax(dim=1)
                correct += (preds == y).sum().item()
                total   += y.size(0)

        val_acc = correct / total
        print(f"Epoch {epoch+1} | loss {train_loss/len(train_loader):.4f} | val_acc {val_acc:.4f}")

        # ── Early stopping ──
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            torch.save(model.state_dict(), 'best.pt')
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= 5: break

Component	Common Choice	When to Change
Optimizer	`AdamW`	SGD+momentum for large vision models; Adafactor for huge LLMs (saves memory)
Learning rate	`3e-4` (new), `2e-5` (fine-tune)	Use LR finder or cosine schedule with warmup; always schedule
Loss (classification)	`CrossEntropyLoss`	Label smoothing (0.1) for overconfident models; focal loss for class imbalance
Loss (regression)	`MSELoss`	Huber loss (SmoothL1) when outliers are present
Scheduler	`OneCycleLR`	Cosine annealing with warmup for Transformer fine-tuning
Gradient clipping	`max_norm=1.0`	Always apply for RNNs and Transformers; reduces instability
Mixed precision	`torch.autocast + GradScaler`	Always — 2× speedup and 2× memory reduction on modern GPUs for free

🛡️

Regularization

Preventing overfitting — the most common real-world challenge

Dropout

Randomly zero out activations during training with probability p. Forces the network to not rely on any single feature. Equivalent to training an ensemble of 2ⁿ networks. Disable during inference (model.eval() handles this automatically).

nn.Dropout(p=0.5) — typical; use 0.1–0.2 in Transformers

Batch Normalization

Normalize layer inputs to zero mean, unit variance per-batch, then learn scale/shift. Reduces internal covariate shift. Enables higher learning rates. A form of regularization through noise injection (batch statistics vary). Standard in CNNs; replaced by Layer Norm in Transformers.

Weight Decay (L2)

Add penalty λ||W||² to the loss, shrinking weights toward zero. In AdamW, this is decoupled from the adaptive learning rate — use AdamW, not Adam + L2. Typical: weight_decay=0.01.

Data Augmentation

Artificially expand training data by applying label-preserving transforms. For vision: random crop, flip, color jitter, mixup, cutmix. For text: back-translation, random deletion/swap, paraphrase generation. Most effective regularizer when data is scarce.

✅

Regularization priority: (1) Get more data / augment first. (2) Reduce model capacity (fewer layers / filters). (3) Add dropout. (4) Add weight decay. Never add all at once — you won't know what helped. And always monitor train vs. validation loss to diagnose overfitting vs. underfitting.

🗒️

Cheat Sheet

Quick reference for architecture selection and hyperparameter defaults

Architecture Selection

Task	Architecture	Starting Point
Image classification	CNN → ResNet/EfficientNet	`torchvision.models.resnet50(weights='IMAGENET1K_V2')`
Object detection	CNN backbone + head	YOLOv8, DETR, Faster R-CNN
Text classification	Transformer encoder	`bert-base-uncased` via HuggingFace
Text generation	Transformer decoder	`gpt2` or `Llama-3.2-1B` via HuggingFace
Seq-to-seq (translation, summarization)	Encoder-decoder Transformer	`t5-base` or `facebook/bart-large-cnn`
Sentence embeddings	BERT-style + pooling	`sentence-transformers/all-MiniLM-L6-v2`
Time series (short)	LSTM / GRU	`nn.LSTM(input_size, hidden, num_layers=2, batch_first=True)`
Time series (long-range)	Temporal Transformer	Temporal Fusion Transformer, PatchTST

Hyperparameter Defaults

Hyperparameter	Default	Notes
Learning rate (from scratch)	`1e-3` to `3e-4`	Use LR warmup + cosine decay
Learning rate (fine-tuning)	`1e-5` to `5e-5`	10-100× smaller than from-scratch
Batch size	`32` (vision), `16` (language)	Larger batches → more stable but wider minima; use gradient accumulation if OOM
Dropout (CNN)	`0.3–0.5`	Applied before final FC layer
Dropout (Transformer)	`0.1`	Applied to attention weights and FFN
Weight decay	`0.01`	Applies to weights, not biases/norms
Gradient clip	`1.0`	Essential for RNNs, recommended for Transformers
Warmup steps	5–10% of total steps	Crucial for large models to avoid early instability
LoRA rank r	`8–16`	Start at 8; increase if performance plateaus

Debugging Loss Curves

Overfitting

Training loss ↓, validation loss ↑ (diverges after some point)
Fix: more data, augmentation, dropout, weight decay, reduce model size, early stopping

Underfitting

Both train and val loss stay high, don't converge
Fix: larger model, more epochs, higher learning rate, less regularization

Exploding Gradients

Loss becomes NaN or jumps wildly
Fix: gradient clipping (max_norm=1.0), lower LR, use mixed precision carefully

Loss Not Moving

Loss flat from epoch 1
Fix: check LR (too small), check data pipeline (all same class?), verify model.train() is called, check gradient flow with param.grad

📚

Essential resources: Andrej Karpathy's "Neural Networks: Zero to Hero" series (builds everything from scratch) · fast.ai Practical Deep Learning · Dive into Deep Learning (d2l.ai) · HuggingFace course (free) · Stanford CS231n (vision) and CS224n (NLP).