Neural Networks
Handbook
A complete reference for understanding and building neural networks — from the mathematical foundations of the perceptron to training dynamics, backpropagation, and production-ready code patterns.
Introduction
A neural network is a computational graph of interconnected nodes arranged in layers. Each layer transforms its inputs into a richer representation, and by stacking many such transformations, a network can learn highly complex functions from raw data — without hand-engineering features.
A network with a single hidden layer and enough neurons can approximate any continuous function on a compact domain (Universal Approximation Theorem).
Unlike classical ML, networks learn their own feature extractors. Early layers detect edges; deep layers detect concepts. No domain expertise required.
Every operation is differentiable — so we can compute the gradient of a loss function with respect to every weight and update them automatically via backprop.
Anatomy of a Neural Network
Depth vs. Width: Adding more layers (depth) generally gives better generalization for complex tasks. Adding more neurons per layer (width) increases a layer's representational capacity. Modern practice prefers deep-and-narrow over shallow-and-wide.
Perceptrons
The perceptron, introduced by Rosenblatt in 1958, is the building block of all modern neural networks. It computes a weighted sum of its inputs, adds a bias, then passes the result through an activation function.
= σ(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)
x)A vector of features from the previous layer (or raw data for the first layer). Each input xᵢ is a scalar value — a pixel intensity, a word embedding dimension, a measurement.
W) & Bias (b)Learnable parameters. Weights scale each input's contribution; the bias shifts the activation threshold. These are what gradient descent optimizes during training.
z)The raw dot product before the activation: z = W · x + b. Also called the "net input" or "logit". Unbounded in range — can be any real number.
σ)A non-linear function applied to z. Non-linearity is essential — without it, stacking layers collapses to a single linear transformation. See the Activations section.
From Perceptron to Layer
In a fully-connected (dense) layer, every input is connected to every output neuron. This is computed efficiently as matrix multiplication:
A = σ(Z)
where X ∈ ℝ^(batch × features), W ∈ ℝ^(neurons × features), b ∈ ℝ^neurons
Why bias matters: The bias b shifts the activation function horizontally. Without it, the hyperplane defined by a neuron must always pass through the origin, severely limiting what patterns it can learn. Bias gives each neuron an independent threshold.
Activation Functions
Without non-linear activations, any deep network collapses to a single matrix multiplication — and can only model linear relationships. Activation functions introduce the non-linearity that allows networks to learn arbitrarily complex mappings.
Choosing an Activation Function
| Function | Use case | Problem to watch for |
|---|---|---|
| ReLU | Default choice for hidden layers in feedforward networks, CNNs | Dying ReLU — neurons stuck at 0 if input always negative |
| GELU | Transformers (GPT, BERT), modern architectures | Slower to compute than ReLU |
| Leaky ReLU | Deep networks where dying ReLU is a concern | α is a hyperparameter; or use PReLU to learn it |
| Sigmoid | Binary classification output only | Vanishing gradients in deep hidden layers |
| Softmax | Multi-class classification output (mutually exclusive) | Numerical instability — use log-softmax + NLLLoss |
| Tanh | RNNs, LSTMs (gates), some shallow networks | Vanishing gradients; mostly replaced by ReLU variants |
The vanishing gradient problem: Sigmoid and tanh saturate near their asymptotes — their gradients approach zero. In deep networks, multiplying near-zero gradients through many layers causes earlier layers to receive virtually no learning signal. This is why ReLU-family activations dominate hidden layers in modern architectures.
Forward Propagation
Forward propagation is the process of passing an input X through every layer of the network to produce a prediction ŷ. Each layer applies a linear transformation followed by an activation. The sequence of intermediate values — called the computation graph — is stored and used during backpropagation.
Z^l = A^(l-1) · W^lᵀ + b^l (pre-activation)
A^l = σ^l(Z^l) (post-activation)
ŷ = A^L (output of final layer)
Always cache intermediate values. Every Z^l and A^l computed during the forward pass must be stored — backpropagation needs them to compute gradients efficiently. This is the core memory/speed trade-off in neural network training.
Backpropagation
Backpropagation ("backprop") is an efficient algorithm for computing the gradient of the loss function with respect to every weight in the network. It applies the chain rule of calculus systematically, propagating error signals backwards from the output layer to the input layer.
The Chain Rule
The key insight: if L depends on A which depends on Z which depends on W, then:
δ^l = (W^(l+1)ᵀ · δ^(l+1)) ⊙ σ'(Z^l) (error at layer l)
∂L/∂W^l = δ^l · A^(l-1)ᵀ (weight gradient)
∂L/∂b^l = sum(δ^l, axis=0) (bias gradient)
The Four Steps of Backprop
Autograd in practice: In PyTorch, calling loss.backward() automatically runs the entire backward pass via dynamic autograd. You never implement backprop manually in production — but understanding it is essential for debugging, designing custom layers, and understanding why certain architectures work.
Computational Graph & Autograd
Epochs, Batches & Learning Rate
One complete pass through the entire training dataset. Training typically runs for multiple epochs until the model converges. Early stopping halts when validation loss stops improving.
A subset of the training data processed together in one forward/backward pass. Batch size controls the trade-off between gradient noise (small) and memory usage + speed (large).
How much to scale each gradient update: W ← W - lr · ∇W. The single most important hyperparameter. Too high: diverges. Too low: trains extremely slowly.
Relationships & Terminology
| Term | Definition | Typical value |
|---|---|---|
| Epoch | One full pass over all training samples | 10 – 300+ (until convergence) |
| Batch size | Samples per gradient update step | 32, 64, 128, 256 |
| Iteration / Step | One forward + backward + update cycle | ceil(N / batch_size) per epoch |
| Learning rate | Step size multiplier for gradient updates | 1e-4 to 1e-2 (model dependent) |
| LR schedule | Policy for changing LR during training | Cosine decay, warmup + decay |
| Early stopping | Stop when val loss hasn't improved in N epochs | patience = 5–20 epochs |
Epoch Progress Visualizer
Training progress over 30 epochs (current: epoch 18)
■ completed ■ current □ remaining
Batch Size Trade-offs
- Noisy gradients act as regularization
- Often generalizes better to unseen data
- Slower wall-clock time per epoch
- Lower memory requirement
- Can use higher learning rate effectively
- Accurate gradients → deterministic updates
- Faster wall-clock time with parallel hardware
- May require learning rate scaling (linear scaling rule)
- Can overfit or converge to sharp minima
- Standard for large language models with gradient accumulation
Learning Rate Schedules
The Full Training Loop
Optimizers
The baseline. Update rule: W ← W - lr · ∇W. With momentum: accumulates a velocity vector in directions of consistent gradients. Good for convex problems and generalization in deep learning with careful tuning.
Adaptive learning rates per parameter using first (mean) and second (variance) moment estimates. Excellent default for most tasks. Variants: AdamW (decoupled weight decay), AdaGrad, RMSProp.
v_t = β₂·v_(t-1) + (1-β₂)·∇W² (2nd moment, variance)
m̂_t = m_t / (1 - β₁ᵗ) (bias correction)
v̂_t = v_t / (1 - β₂ᵗ)
W ← W - lr · m̂_t / (√v̂_t + ε) (β₁=0.9, β₂=0.999, ε=1e-8)
Loss Functions
| Loss | Task | Formula | PyTorch |
|---|---|---|---|
| Cross-Entropy | Multi-class classification | -Σ yᵢ·log(ŷᵢ) | nn.CrossEntropyLoss() |
| Binary Cross-Entropy | Binary classification | -y·log(ŷ) - (1-y)·log(1-ŷ) | nn.BCEWithLogitsLoss() |
| MSE | Regression | 1/n · Σ(y - ŷ)² | nn.MSELoss() |
| MAE / L1 | Regression (robust to outliers) | 1/n · Σ|y - ŷ| | nn.L1Loss() |
| Huber | Regression (combines MSE+MAE) | MSE if |e|<δ, else δ·|e| - δ²/2 | nn.HuberLoss() |
| KL Divergence | Distribution matching (VAEs) | Σ P·log(P/Q) | nn.KLDivLoss() |
Regularization
Randomly zeros out neurons during training with probability p (typically 0.2–0.5). Forces the network to learn redundant representations. Disabled at inference time.
Normalizes layer inputs to zero mean and unit variance per mini-batch. Reduces sensitivity to initialization, enables higher learning rates, acts as a regularizer.
Adds a penalty term λ·||W||² to the loss that encourages small weights. AdamW applies weight decay correctly (decoupled from the adaptive LR). Use weight_decay=1e-4 as a starting point.
Artificially expands the training set by applying random transforms (flips, crops, rotations, color jitter for vision; token masking, paraphrase for NLP). The most effective regularizer when data is scarce.
Code Examples
PyTorch — Custom MLP with Best Practices
TensorFlow / Keras
Glossary
| Term | Definition |
|---|---|
| Neuron / Node | A single computational unit: computes a weighted sum of inputs, adds a bias, applies an activation function. |
| Weight | A learnable scalar parameter that scales an input. Stored in matrices W for efficient computation. |
| Bias | A learnable offset added to the weighted sum. Allows neurons to activate even when all inputs are zero. |
| Activation | A non-linear function applied after the linear transformation. Introduces expressive power to the network. |
| Layer | A collection of neurons operating in parallel. Input, hidden, and output layers form the network's depth. |
| Parameters | All learnable values in a network — all weights and biases. The count determines model capacity. |
| Loss / Cost | A scalar measuring prediction error. The objective function minimized by gradient descent. |
| Gradient | The vector of partial derivatives of the loss w.r.t. each parameter. Points in the direction of steepest ascent. |
| Gradient descent | Iteratively move parameters opposite to the gradient to minimize loss. The fundamental optimization algorithm. |
| Mini-batch | A subset of training data. Stochastic gradient descent (SGD) uses batch_size=1; mini-batch uses 16–512. |
| Epoch | One complete pass over the entire training dataset (many mini-batch updates). |
| Learning rate | The step size multiplier for parameter updates. Controls how fast the model learns. |
| Overfitting | Model memorizes training data but fails to generalize. Low train loss, high val loss. |
| Underfitting | Model is too simple or undertrained. High train AND val loss. |
| Generalization | The ability to perform well on unseen data. The ultimate goal of training. |
| Inference | Running the trained model to make predictions on new data. No gradient computation needed. |
| Logit | The raw (pre-softmax/sigmoid) output of the final linear layer. |
| He initialization | Weight init scheme for ReLU networks: W ~ N(0, 2/fan_in). Prevents vanishing/exploding activations. |
Cheat Sheet
- Learning rate:
3e-4(Karpathy's constant for Adam) - Batch size:
32or64to start - Weight decay:
1e-4with AdamW - Dropout:
0.1–0.3hidden,0.0on output - Gradient clipping:
max_norm=1.0
- Overfit a single batch first — loss should reach ~0
- Check input normalization (zero mean, unit variance)
- Print gradient norms — should be ~1e-1 to 1e-3
- Confirm
model.train()/model.eval()toggling - Use
torch.no_grad()for all eval/inference
- Use BatchNorm before activation in deep networks
- Skip connections (ResNet-style) for >10 layers
- Start with fewer parameters, scale up gradually
- He init for ReLU/GELU, Xavier for sigmoid/tanh
- Binary classification: 1 neuron + sigmoid + BCE
- Multi-class: N neurons + softmax + CrossEntropy
- Regression: N neurons + no activation + MSE/Huber
- Multi-label: N neurons + sigmoid + BCE per class
The fundamental trade-off: Every decision in neural network design balances model capacity (can it represent the target function?) against generalization (will it work on unseen data?). More parameters = more capacity = more risk of overfitting. Regularization (dropout, weight decay, data augmentation, early stopping) is the toolkit for managing this trade-off.