V1
Back to handbooks index
NN
Neural Networks
Deep Learning Backprop PyTorch / TF
Foundational Concepts

Neural Networks
Handbook

A complete reference for understanding and building neural networks — from the mathematical foundations of the perceptron to training dynamics, backpropagation, and production-ready code patterns.

Perceptrons Activations Backpropagation Gradient Descent Epochs & Batches
🧠

Introduction

What neural networks are and why they work

A neural network is a computational graph of interconnected nodes arranged in layers. Each layer transforms its inputs into a richer representation, and by stacking many such transformations, a network can learn highly complex functions from raw data — without hand-engineering features.

Universal Approximators

A network with a single hidden layer and enough neurons can approximate any continuous function on a compact domain (Universal Approximation Theorem).

Learned Representations

Unlike classical ML, networks learn their own feature extractors. Early layers detect edges; deep layers detect concepts. No domain expertise required.

Differentiable Programs

Every operation is differentiable — so we can compute the gradient of a loss function with respect to every weight and update them automatically via backprop.

Anatomy of a Neural Network

📥layerInput
🔘layer 1Hidden
🔘layer 2Hidden
🔘layer nHidden
📤layerOutput

Depth vs. Width: Adding more layers (depth) generally gives better generalization for complex tasks. Adding more neurons per layer (width) increases a layer's representational capacity. Modern practice prefers deep-and-narrow over shallow-and-wide.

Perceptrons

The atomic unit of every neural network

The perceptron, introduced by Rosenblatt in 1958, is the building block of all modern neural networks. It computes a weighted sum of its inputs, adds a bias, then passes the result through an activation function.

Single Neuron Output
y = σ(W · x + b)
= σ(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
Inputs (x)

A vector of features from the previous layer (or raw data for the first layer). Each input xᵢ is a scalar value — a pixel intensity, a word embedding dimension, a measurement.

Weights (W) & Bias (b)

Learnable parameters. Weights scale each input's contribution; the bias shifts the activation threshold. These are what gradient descent optimizes during training.

Pre-activation (z)

The raw dot product before the activation: z = W · x + b. Also called the "net input" or "logit". Unbounded in range — can be any real number.

Activation (σ)

A non-linear function applied to z. Non-linearity is essential — without it, stacking layers collapses to a single linear transformation. See the Activations section.

From Perceptron to Layer

In a fully-connected (dense) layer, every input is connected to every output neuron. This is computed efficiently as matrix multiplication:

Dense Layer (vectorized over N neurons)
Z = X · Wᵀ + b
A = σ(Z)
where X ∈ ℝ^(batch × features), W ∈ ℝ^(neurons × features), b ∈ ℝ^neurons
Python — NumPy
import numpy as np # ── Single perceptron (manual) ── def perceptron(x, w, b): z = np.dot(w, x) + b # weighted sum + bias a = sigmoid(z) # activation return a def sigmoid(z): return 1 / (1 + np.exp(-z)) # ── Dense layer (vectorized over batch) ── def dense_layer(X, W, b, activation): Z = X @ W.T + b # (batch, features) @ (neurons, features)ᵀ return activation(Z) # (batch, neurons) # Example: batch of 4 samples, 3 features → 5 neurons X = np.random.randn(4, 3) # 4 samples, 3 features W = np.random.randn(5, 3) # 5 neurons, each with 3 weights b = np.zeros(5) output = dense_layer(X, W, b, sigmoid) print(output.shape) # (4, 5)
🔑

Why bias matters: The bias b shifts the activation function horizontally. Without it, the hyperplane defined by a neuron must always pass through the origin, severely limiting what patterns it can learn. Bias gives each neuron an independent threshold.

Activation Functions

Non-linearities that give networks their expressive power

Without non-linear activations, any deep network collapses to a single matrix multiplication — and can only model linear relationships. Activation functions introduce the non-linearity that allows networks to learn arbitrarily complex mappings.

Recommended
ReLU
Range: [0, ∞)
f(x) = max(0, x)
Recommended
GELU
Range: (-∞, ∞)
x · Φ(x)
Deep nets
Leaky ReLU
Range: (-∞, ∞)
max(αx, x), α≈0.01
Output layer
Sigmoid
Range: (0, 1)
1 / (1 + e⁻ˣ)
Output layer
Softmax
Range: (0,1), sums to 1
eˣⁱ / Σ eˣʲ
Legacy
Tanh
Range: (-1, 1)
(eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Choosing an Activation Function

FunctionUse caseProblem to watch for
ReLUDefault choice for hidden layers in feedforward networks, CNNsDying ReLU — neurons stuck at 0 if input always negative
GELUTransformers (GPT, BERT), modern architecturesSlower to compute than ReLU
Leaky ReLUDeep networks where dying ReLU is a concernα is a hyperparameter; or use PReLU to learn it
SigmoidBinary classification output onlyVanishing gradients in deep hidden layers
SoftmaxMulti-class classification output (mutually exclusive)Numerical instability — use log-softmax + NLLLoss
TanhRNNs, LSTMs (gates), some shallow networksVanishing gradients; mostly replaced by ReLU variants

The vanishing gradient problem: Sigmoid and tanh saturate near their asymptotes — their gradients approach zero. In deep networks, multiplying near-zero gradients through many layers causes earlier layers to receive virtually no learning signal. This is why ReLU-family activations dominate hidden layers in modern architectures.

Python — all activations
import numpy as np def relu(z): return np.maximum(0, z) def relu_grad(z): return (z > 0).astype('float32') # 1 where z>0, else 0 def sigmoid(z): return 1 / (1 + np.exp(-z)) def sigmoid_grad(z): s = sigmoid(z) return s * (1 - s) # σ(z)(1 - σ(z)) def tanh(z): return np.tanh(z) def tanh_grad(z): return 1 - np.tanh(z)**2 def leaky_relu(z, alpha=0.01): return np.where(z > 0, z, alpha * z) def softmax(z): e = np.exp(z - z.max(axis=-1, keepdims=True)) # stable return e / e.sum(axis=-1, keepdims=True) def gelu(z): from scipy.special import erf return 0.5 * z * (1 + erf(z / np.sqrt(2)))
➡️

Forward Propagation

Passing data through the network to make a prediction

Forward propagation is the process of passing an input X through every layer of the network to produce a prediction ŷ. Each layer applies a linear transformation followed by an activation. The sequence of intermediate values — called the computation graph — is stored and used during backpropagation.

Layer-wise forward pass (L layers)
A⁰ = X    (input)
Z^l = A^(l-1) · W^lᵀ + b^l    (pre-activation)
A^l = σ^l(Z^l)    (post-activation)
ŷ = A^L    (output of final layer)
Python — 3-layer network forward pass
import numpy as np # ── Initialize parameters ── def init_params(layer_sizes): """layer_sizes: [input_dim, h1, h2, output_dim]""" params = {} for i in range(1, len(layer_sizes)): fan_in = layer_sizes[i-1] fan_out = layer_sizes[i] # He initialization — optimal for ReLU params[f'W{i}'] = np.random.randn(fan_out, fan_in) * np.sqrt(2 / fan_in) params[f'b{i}'] = np.zeros(fan_out) return params # ── Forward pass ── def forward(X, params, num_layers): cache = {'A0': X} # cache all values for backprop A = X for l in range(1, num_layers + 1): W = params[f'W{l}'] b = params[f'b{l}'] Z = A @ W.T + b # linear transformation if l < num_layers: A = np.maximum(0, Z) # ReLU for hidden layers else: A = softmax(Z) # softmax for output cache[f'Z{l}'] = Z cache[f'A{l}'] = A return A, cache # A = ŷ (predictions) # ── Example ── params = init_params([784, 256, 128, 10]) # MNIST-like X_batch = np.random.randn(32, 784) # 32 images probs, cache = forward(X_batch, params, num_layers=3) print(probs.shape) # (32, 10)

Always cache intermediate values. Every Z^l and A^l computed during the forward pass must be stored — backpropagation needs them to compute gradients efficiently. This is the core memory/speed trade-off in neural network training.

🔁

Backpropagation

Computing gradients via the chain rule

Backpropagation ("backprop") is an efficient algorithm for computing the gradient of the loss function with respect to every weight in the network. It applies the chain rule of calculus systematically, propagating error signals backwards from the output layer to the input layer.

The Chain Rule

The key insight: if L depends on A which depends on Z which depends on W, then:

Chain Rule applied to backprop
∂L/∂W = ∂L/∂A · ∂A/∂Z · ∂Z/∂W

δ^l = (W^(l+1)ᵀ · δ^(l+1)) σ'(Z^l)    (error at layer l)
∂L/∂W^l = δ^l · A^(l-1)ᵀ    (weight gradient)
∂L/∂b^l = sum(δ^l, axis=0)    (bias gradient)

The Four Steps of Backprop

📊step 1Compute Loss
📤step 2Output gradient
🔙step 3Back through layers
🔧step 4Update weights
Python — manual backprop
def cross_entropy_loss(probs, y_true): """y_true: one-hot encoded (batch, classes)""" m = y_true.shape[0] return -np.sum(y_true * np.log(probs + 1e-9)) / m def backward(probs, y_true, cache, params, num_layers): m = y_true.shape[0] grads = {} # ── Step 1: gradient at output layer (softmax + cross-entropy) ── # Combined gradient of cross-entropy + softmax = (ŷ - y) / m dZ = (probs - y_true) / m # (batch, classes) for l in reversed(range(1, num_layers + 1)): A_prev = cache[f'A{l-1}'] # ── Gradients w.r.t. weights and biases ── grads[f'dW{l}'] = dZ.T @ A_prev # (neurons_l, neurons_{l-1}) grads[f'db{l}'] = dZ.sum(axis=0) # (neurons_l,) if l > 1: # ── Propagate error to previous layer ── dA_prev = dZ @ params[f'W{l}'] # chain rule step Z_prev = cache[f'Z{l-1}'] dZ = dA_prev * (Z_prev > 0) # ⊙ ReLU derivative return grads # ── Weight update (SGD) ── def update_params(params, grads, lr=0.01): for key in params: params[key] -= lr * grads[f'd{key}'] return params

Autograd in practice: In PyTorch, calling loss.backward() automatically runs the entire backward pass via dynamic autograd. You never implement backprop manually in production — but understanding it is essential for debugging, designing custom layers, and understanding why certain architectures work.

Computational Graph & Autograd

PyTorch — autograd
import torch import torch.nn as nn model = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 128), nn.ReLU(), nn.Linear(128, 10), ) criterion = nn.CrossEntropyLoss() # includes softmax + NLL optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) X = torch.randn(32, 784) # batch of 32 y = torch.randint(0, 10, (32,)) # class labels # One training step optimizer.zero_grad() # ① clear accumulated gradients logits = model(X) # ② forward pass (builds graph) loss = criterion(logits, y) # ③ compute loss loss.backward() # ④ backward pass (backprop) optimizer.step() # ⑤ update weights print(f'Loss: {loss.item():.4f}')
🔄

Epochs, Batches & Learning Rate

The three key controls of the training loop
Epoch

One complete pass through the entire training dataset. Training typically runs for multiple epochs until the model converges. Early stopping halts when validation loss stops improving.

Batch (Mini-batch)

A subset of the training data processed together in one forward/backward pass. Batch size controls the trade-off between gradient noise (small) and memory usage + speed (large).

Learning Rate

How much to scale each gradient update: W ← W - lr · ∇W. The single most important hyperparameter. Too high: diverges. Too low: trains extremely slowly.

Relationships & Terminology

TermDefinitionTypical value
EpochOne full pass over all training samples10 – 300+ (until convergence)
Batch sizeSamples per gradient update step32, 64, 128, 256
Iteration / StepOne forward + backward + update cycleceil(N / batch_size) per epoch
Learning rateStep size multiplier for gradient updates1e-4 to 1e-2 (model dependent)
LR schedulePolicy for changing LR during trainingCosine decay, warmup + decay
Early stoppingStop when val loss hasn't improved in N epochspatience = 5–20 epochs

Epoch Progress Visualizer

Training progress over 30 epochs (current: epoch 18)

■ completed   ■ current   □ remaining

Batch Size Trade-offs

Small Batch (8–64)
  • Noisy gradients act as regularization
  • Often generalizes better to unseen data
  • Slower wall-clock time per epoch
  • Lower memory requirement
  • Can use higher learning rate effectively
Large Batch (512–8192)
  • Accurate gradients → deterministic updates
  • Faster wall-clock time with parallel hardware
  • May require learning rate scaling (linear scaling rule)
  • Can overfit or converge to sharp minima
  • Standard for large language models with gradient accumulation

Learning Rate Schedules

PyTorch — LR schedules
from torch.optim import lr_scheduler optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # ── Step decay: multiply LR by γ every N epochs ── scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5) # ── Cosine annealing: smoothly decays to near-0 ── scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50) # ── Warm-up + cosine decay (preferred for transformers) ── scheduler = lr_scheduler.OneCycleLR( optimizer, max_lr=1e-2, steps_per_epoch=len(train_loader), epochs=30, pct_start=0.3, # 30% warmup ) # ── Reduce on plateau: detect stagnation ── scheduler = lr_scheduler.ReduceLROnPlateau( optimizer, mode='min', factor=0.5, patience=5 ) # ── Call after each epoch (or step for OneCycleLR) ── for epoch in range(num_epochs): train_one_epoch(model, train_loader, optimizer) val_loss = evaluate(model, val_loader) scheduler.step(val_loss) # for ReduceLROnPlateau # scheduler.step() # for all others

The Full Training Loop

PyTorch — complete training loop
import torch from torch.utils.data import DataLoader def train(model, train_loader, val_loader, epochs=30, lr=1e-3): optimizer = torch.optim.Adam(model.parameters(), lr=lr) criterion = nn.CrossEntropyLoss() best_val_loss = float('inf') patience_count = 0 patience = 5 for epoch in range(epochs): # ── Training phase ── model.train() train_loss = 0 for X_batch, y_batch in train_loader: optimizer.zero_grad() # clear old gradients logits = model(X_batch) # forward pass loss = criterion(logits, y_batch) # compute loss loss.backward() # backward pass torch.nn.utils.clip_grad_norm_( # gradient clipping model.parameters(), max_norm=1.0) optimizer.step() # weight update train_loss += loss.item() # ── Validation phase ── model.eval() val_loss = 0 with torch.no_grad(): # disable autograd for eval for X_val, y_val in val_loader: logits = model(X_val) val_loss += criterion(logits, y_val).item() avg_train = train_loss / len(train_loader) avg_val = val_loss / len(val_loader) print(f"Epoch {epoch+1:3d} | train: {avg_train:.4f} | val: {avg_val:.4f}") # ── Early stopping ── if avg_val < best_val_loss: best_val_loss = avg_val patience_count = 0 torch.save(model.state_dict(), 'best_model.pth') else: patience_count += 1 if patience_count >= patience: print(f"Early stopping at epoch {epoch+1}") break
⚙️

Optimizers

How gradient information is turned into weight updates
SGD (Stochastic Gradient Descent)

The baseline. Update rule: W ← W - lr · ∇W. With momentum: accumulates a velocity vector in directions of consistent gradients. Good for convex problems and generalization in deep learning with careful tuning.

Adam

Adaptive learning rates per parameter using first (mean) and second (variance) moment estimates. Excellent default for most tasks. Variants: AdamW (decoupled weight decay), AdaGrad, RMSProp.

Adam update rule
m_t = β₁·m_(t-1) + (1-β₁)·∇W    (1st moment, mean)
v_t = β₂·v_(t-1) + (1-β₂)·∇W²    (2nd moment, variance)
m̂_t = m_t / (1 - β₁ᵗ)    (bias correction)
v̂_t = v_t / (1 - β₂ᵗ)
W ← W - lr · m̂_t / (√v̂_t + ε)    (β₁=0.9, β₂=0.999, ε=1e-8)
📉

Loss Functions

Measuring how wrong the network's predictions are
LossTaskFormulaPyTorch
Cross-EntropyMulti-class classification-Σ yᵢ·log(ŷᵢ)nn.CrossEntropyLoss()
Binary Cross-EntropyBinary classification-y·log(ŷ) - (1-y)·log(1-ŷ)nn.BCEWithLogitsLoss()
MSERegression1/n · Σ(y - ŷ)²nn.MSELoss()
MAE / L1Regression (robust to outliers)1/n · Σ|y - ŷ|nn.L1Loss()
HuberRegression (combines MSE+MAE)MSE if |e|<δ, else δ·|e| - δ²/2nn.HuberLoss()
KL DivergenceDistribution matching (VAEs)Σ P·log(P/Q)nn.KLDivLoss()
🛡️

Regularization

Preventing overfitting in neural networks
Dropout

Randomly zeros out neurons during training with probability p (typically 0.2–0.5). Forces the network to learn redundant representations. Disabled at inference time.

Batch Normalization

Normalizes layer inputs to zero mean and unit variance per mini-batch. Reduces sensitivity to initialization, enables higher learning rates, acts as a regularizer.

Weight Decay (L2)

Adds a penalty term λ·||W||² to the loss that encourages small weights. AdamW applies weight decay correctly (decoupled from the adaptive LR). Use weight_decay=1e-4 as a starting point.

Data Augmentation

Artificially expands the training set by applying random transforms (flips, crops, rotations, color jitter for vision; token masking, paraphrase for NLP). The most effective regularizer when data is scarce.

💻

Code Examples

End-to-end patterns in PyTorch and TensorFlow/Keras

PyTorch — Custom MLP with Best Practices

PyTorch
import torch import torch.nn as nn class MLP(nn.Module): def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3): super().__init__() layers = [] prev_dim = input_dim for h in hidden_dims: layers += [ nn.Linear(prev_dim, h), nn.BatchNorm1d(h), nn.GELU(), nn.Dropout(dropout), ] prev_dim = h layers.append(nn.Linear(prev_dim, output_dim)) self.net = nn.Sequential(*layers) # He initialization for all linear layers for m in self.modules(): if isinstance(m, nn.Linear): nn.init.kaiming_normal_(m.weight, nonlinearity='relu') nn.init.zeros_(m.bias) def forward(self, x): return self.net(x) # ── Instantiate ── model = MLP( input_dim=784, hidden_dims=[512, 256, 128], output_dim=10, dropout=0.3, ) print(sum(p.numel() for p in model.parameters()), 'trainable parameters')

TensorFlow / Keras

TensorFlow 2 / Keras
import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers def build_mlp(input_dim, hidden_dims, output_dim, dropout=0.3): inputs = keras.Input(shape=(input_dim,)) x = inputs for h in hidden_dims: x = layers.Dense(h, kernel_initializer='he_normal')(x) x = layers.BatchNormalization()(x) x = layers.Activation('gelu')(x) x = layers.Dropout(dropout)(x) outputs = layers.Dense(output_dim)(x) return keras.Model(inputs, outputs) model = build_mlp(784, [512, 256, 128], 10) model.compile( optimizer=keras.optimizers.Adam(learning_rate=1e-3, weight_decay=1e-4), loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy'] ) model.fit( x_train, y_train, epochs=30, batch_size=128, validation_split=0.15, callbacks=[ keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True), keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3), ] )
📖

Glossary

Key terms and definitions
TermDefinition
Neuron / NodeA single computational unit: computes a weighted sum of inputs, adds a bias, applies an activation function.
WeightA learnable scalar parameter that scales an input. Stored in matrices W for efficient computation.
BiasA learnable offset added to the weighted sum. Allows neurons to activate even when all inputs are zero.
ActivationA non-linear function applied after the linear transformation. Introduces expressive power to the network.
LayerA collection of neurons operating in parallel. Input, hidden, and output layers form the network's depth.
ParametersAll learnable values in a network — all weights and biases. The count determines model capacity.
Loss / CostA scalar measuring prediction error. The objective function minimized by gradient descent.
GradientThe vector of partial derivatives of the loss w.r.t. each parameter. Points in the direction of steepest ascent.
Gradient descentIteratively move parameters opposite to the gradient to minimize loss. The fundamental optimization algorithm.
Mini-batchA subset of training data. Stochastic gradient descent (SGD) uses batch_size=1; mini-batch uses 16–512.
EpochOne complete pass over the entire training dataset (many mini-batch updates).
Learning rateThe step size multiplier for parameter updates. Controls how fast the model learns.
OverfittingModel memorizes training data but fails to generalize. Low train loss, high val loss.
UnderfittingModel is too simple or undertrained. High train AND val loss.
GeneralizationThe ability to perform well on unseen data. The ultimate goal of training.
InferenceRunning the trained model to make predictions on new data. No gradient computation needed.
LogitThe raw (pre-softmax/sigmoid) output of the final linear layer.
He initializationWeight init scheme for ReLU networks: W ~ N(0, 2/fan_in). Prevents vanishing/exploding activations.
🗒️

Cheat Sheet

Quick-reference one-liners for daily neural network work
Key hyperparameter starting points
  • Learning rate: 3e-4 (Karpathy's constant for Adam)
  • Batch size: 32 or 64 to start
  • Weight decay: 1e-4 with AdamW
  • Dropout: 0.1–0.3 hidden, 0.0 on output
  • Gradient clipping: max_norm=1.0
Debugging checklist
  • Overfit a single batch first — loss should reach ~0
  • Check input normalization (zero mean, unit variance)
  • Print gradient norms — should be ~1e-1 to 1e-3
  • Confirm model.train() / model.eval() toggling
  • Use torch.no_grad() for all eval/inference
Architecture rules of thumb
  • Use BatchNorm before activation in deep networks
  • Skip connections (ResNet-style) for >10 layers
  • Start with fewer parameters, scale up gradually
  • He init for ReLU/GELU, Xavier for sigmoid/tanh
Output layer guide
  • Binary classification: 1 neuron + sigmoid + BCE
  • Multi-class: N neurons + softmax + CrossEntropy
  • Regression: N neurons + no activation + MSE/Huber
  • Multi-label: N neurons + sigmoid + BCE per class
🔑

The fundamental trade-off: Every decision in neural network design balances model capacity (can it represent the target function?) against generalization (will it work on unseen data?). More parameters = more capacity = more risk of overfitting. Regularization (dropout, weight decay, data augmentation, early stopping) is the toolkit for managing this trade-off.