Neural Networks Handbook

🧠

Introduction

What neural networks are and why they work

A neural network is a computational graph of interconnected nodes arranged in layers. Each layer transforms its inputs into a richer representation, and by stacking many such transformations, a network can learn highly complex functions from raw data — without hand-engineering features.

Universal Approximators

A network with a single hidden layer and enough neurons can approximate any continuous function on a compact domain (Universal Approximation Theorem).

Learned Representations

Unlike classical ML, networks learn their own feature extractors. Early layers detect edges; deep layers detect concepts. No domain expertise required.

Differentiable Programs

Every operation is differentiable — so we can compute the gradient of a loss function with respect to every weight and update them automatically via backprop.

Anatomy of a Neural Network

📥layerInput

→

🔘layer 1Hidden

→

🔘layer 2Hidden

→

🔘layer nHidden

→

📤layerOutput

ℹ

Depth vs. Width: Adding more layers (depth) generally gives better generalization for complex tasks. Adding more neurons per layer (width) increases a layer's representational capacity. Modern practice prefers deep-and-narrow over shallow-and-wide.

⚡

Perceptrons

The atomic unit of every neural network

The perceptron, introduced by Rosenblatt in 1958, is the building block of all modern neural networks. It computes a weighted sum of its inputs, adds a bias, then passes the result through an activation function.

Single Neuron Output

y = σ(W · x + b)
= σ(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)

Inputs (x)

A vector of features from the previous layer (or raw data for the first layer). Each input xᵢ is a scalar value — a pixel intensity, a word embedding dimension, a measurement.

Weights (W) & Bias (b)

Learnable parameters. Weights scale each input's contribution; the bias shifts the activation threshold. These are what gradient descent optimizes during training.

Pre-activation (z)

The raw dot product before the activation: z = W · x + b. Also called the "net input" or "logit". Unbounded in range — can be any real number.

Activation (σ)

A non-linear function applied to z. Non-linearity is essential — without it, stacking layers collapses to a single linear transformation. See the Activations section.

From Perceptron to Layer

In a fully-connected (dense) layer, every input is connected to every output neuron. This is computed efficiently as matrix multiplication:

Dense Layer (vectorized over N neurons)

Z = X · Wᵀ + b
A = σ(Z)
where X ∈ ℝ^(batch × features), W ∈ ℝ^(neurons × features), b ∈ ℝ^neurons

Python — NumPy

import numpy as np

# ── Single perceptron (manual) ──
def perceptron(x, w, b):
    z = np.dot(w, x) + b          # weighted sum + bias
    a = sigmoid(z)                # activation
    return a

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# ── Dense layer (vectorized over batch) ──
def dense_layer(X, W, b, activation):
    Z = X @ W.T + b              # (batch, features) @ (neurons, features)ᵀ
    return activation(Z)          # (batch, neurons)

# Example: batch of 4 samples, 3 features → 5 neurons
X = np.random.randn(4, 3)      # 4 samples, 3 features
W = np.random.randn(5, 3)      # 5 neurons, each with 3 weights
b = np.zeros(5)
output = dense_layer(X, W, b, sigmoid)
print(output.shape)              # (4, 5)

🔑

Why bias matters: The bias b shifts the activation function horizontally. Without it, the hyperplane defined by a neuron must always pass through the origin, severely limiting what patterns it can learn. Bias gives each neuron an independent threshold.

⚡

Activation Functions

Non-linearities that give networks their expressive power

Without non-linear activations, any deep network collapses to a single matrix multiplication — and can only model linear relationships. Activation functions introduce the non-linearity that allows networks to learn arbitrarily complex mappings.

Recommended

ReLU

Range: [0, ∞)

f(x) = max(0, x)

Recommended

GELU

Range: (-∞, ∞)

x · Φ(x)

Deep nets

Leaky ReLU

Range: (-∞, ∞)

max(αx, x), α≈0.01

Output layer

Sigmoid

Range: (0, 1)

1 / (1 + e⁻ˣ)

Output layer

Softmax

Range: (0,1), sums to 1

eˣⁱ / Σ eˣʲ

Legacy

Tanh

Range: (-1, 1)

(eˣ - e⁻ˣ) / (eˣ + e⁻ˣ)

Choosing an Activation Function

Function	Use case	Problem to watch for
ReLU	Default choice for hidden layers in feedforward networks, CNNs	Dying ReLU — neurons stuck at 0 if input always negative
GELU	Transformers (GPT, BERT), modern architectures	Slower to compute than ReLU
Leaky ReLU	Deep networks where dying ReLU is a concern	α is a hyperparameter; or use PReLU to learn it
Sigmoid	Binary classification output only	Vanishing gradients in deep hidden layers
Softmax	Multi-class classification output (mutually exclusive)	Numerical instability — use log-softmax + NLLLoss
Tanh	RNNs, LSTMs (gates), some shallow networks	Vanishing gradients; mostly replaced by ReLU variants

⚠

The vanishing gradient problem: Sigmoid and tanh saturate near their asymptotes — their gradients approach zero. In deep networks, multiplying near-zero gradients through many layers causes earlier layers to receive virtually no learning signal. This is why ReLU-family activations dominate hidden layers in modern architectures.

Python — all activations

import numpy as np

def relu(z):
    return np.maximum(0, z)

def relu_grad(z):
    return (z > 0).astype('float32')    # 1 where z>0, else 0

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_grad(z):
    s = sigmoid(z)
    return s * (1 - s)                   # σ(z)(1 - σ(z))

def tanh(z):
    return np.tanh(z)

def tanh_grad(z):
    return 1 - np.tanh(z)**2

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def softmax(z):
    e = np.exp(z - z.max(axis=-1, keepdims=True))  # stable
    return e / e.sum(axis=-1, keepdims=True)

def gelu(z):
    from scipy.special import erf
    return 0.5 * z * (1 + erf(z / np.sqrt(2)))

➡️

Forward Propagation

Passing data through the network to make a prediction

Forward propagation is the process of passing an input X through every layer of the network to produce a prediction ŷ. Each layer applies a linear transformation followed by an activation. The sequence of intermediate values — called the computation graph — is stored and used during backpropagation.

Layer-wise forward pass (L layers)

A⁰ = X    (input)
Z^l = A^(l-1) · W^lᵀ + b^l    (pre-activation)
A^l = σ^l(Z^l)    (post-activation)
ŷ = A^L    (output of final layer)

Python — 3-layer network forward pass

import numpy as np

# ── Initialize parameters ──
def init_params(layer_sizes):
    """layer_sizes: [input_dim, h1, h2, output_dim]"""
    params = {}
    for i in range(1, len(layer_sizes)):
        fan_in = layer_sizes[i-1]
        fan_out = layer_sizes[i]
        # He initialization — optimal for ReLU
        params[f'W{i}'] = np.random.randn(fan_out, fan_in) * np.sqrt(2 / fan_in)
        params[f'b{i}'] = np.zeros(fan_out)
    return params

# ── Forward pass ──
def forward(X, params, num_layers):
    cache = {'A0': X}     # cache all values for backprop
    A = X

    for l in range(1, num_layers + 1):
        W = params[f'W{l}']
        b = params[f'b{l}']
        Z = A @ W.T + b          # linear transformation

        if l < num_layers:
            A = np.maximum(0, Z) # ReLU for hidden layers
        else:
            A = softmax(Z)       # softmax for output

        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A

    return A, cache              # A = ŷ (predictions)

# ── Example ──
params = init_params([784, 256, 128, 10])  # MNIST-like
X_batch = np.random.randn(32, 784)           # 32 images
probs, cache = forward(X_batch, params, num_layers=3)
print(probs.shape)                             # (32, 10)

✅

Always cache intermediate values. Every Z^l and A^l computed during the forward pass must be stored — backpropagation needs them to compute gradients efficiently. This is the core memory/speed trade-off in neural network training.

🔁

Backpropagation

Computing gradients via the chain rule

Backpropagation ("backprop") is an efficient algorithm for computing the gradient of the loss function with respect to every weight in the network. It applies the chain rule of calculus systematically, propagating error signals backwards from the output layer to the input layer.

The Chain Rule

The key insight: if L depends on A which depends on Z which depends on W, then:

Chain Rule applied to backprop

∂L/∂W = ∂L/∂A · ∂A/∂Z · ∂Z/∂W

δ^l = (W^(l+1)ᵀ · δ^(l+1)) ⊙ σ'(Z^l)    (error at layer l)
∂L/∂W^l = δ^l · A^(l-1)ᵀ    (weight gradient)
∂L/∂b^l = sum(δ^l, axis=0)    (bias gradient)

The Four Steps of Backprop

📊step 1Compute Loss

→

📤step 2Output gradient

→

🔙step 3Back through layers

→

🔧step 4Update weights

Python — manual backprop

def cross_entropy_loss(probs, y_true):
    """y_true: one-hot encoded (batch, classes)"""
    m = y_true.shape[0]
    return -np.sum(y_true * np.log(probs + 1e-9)) / m

def backward(probs, y_true, cache, params, num_layers):
    m = y_true.shape[0]
    grads = {}

    # ── Step 1: gradient at output layer (softmax + cross-entropy) ──
    # Combined gradient of cross-entropy + softmax = (ŷ - y) / m
    dZ = (probs - y_true) / m            # (batch, classes)

    for l in reversed(range(1, num_layers + 1)):
        A_prev = cache[f'A{l-1}']

        # ── Gradients w.r.t. weights and biases ──
        grads[f'dW{l}'] = dZ.T @ A_prev           # (neurons_l, neurons_{l-1})
        grads[f'db{l}'] = dZ.sum(axis=0)          # (neurons_l,)

        if l > 1:
            # ── Propagate error to previous layer ──
            dA_prev = dZ @ params[f'W{l}']         # chain rule step
            Z_prev = cache[f'Z{l-1}']
            dZ = dA_prev * (Z_prev > 0)            # ⊙ ReLU derivative

    return grads

# ── Weight update (SGD) ──
def update_params(params, grads, lr=0.01):
    for key in params:
        params[key] -= lr * grads[f'd{key}']
    return params

ℹ

Autograd in practice: In PyTorch, calling loss.backward() automatically runs the entire backward pass via dynamic autograd. You never implement backprop manually in production — but understanding it is essential for debugging, designing custom layers, and understanding why certain architectures work.

Computational Graph & Autograd

PyTorch — autograd

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

criterion = nn.CrossEntropyLoss()       # includes softmax + NLL
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

X = torch.randn(32, 784)               # batch of 32
y = torch.randint(0, 10, (32,))          # class labels

# One training step
optimizer.zero_grad()                    # ① clear accumulated gradients
logits = model(X)                        # ② forward pass (builds graph)
loss = criterion(logits, y)              # ③ compute loss
loss.backward()                           # ④ backward pass (backprop)
optimizer.step()                          # ⑤ update weights

print(f'Loss: {loss.item():.4f}')

🔄

Epochs, Batches & Learning Rate

The three key controls of the training loop

Epoch

One complete pass through the entire training dataset. Training typically runs for multiple epochs until the model converges. Early stopping halts when validation loss stops improving.

Batch (Mini-batch)

A subset of the training data processed together in one forward/backward pass. Batch size controls the trade-off between gradient noise (small) and memory usage + speed (large).

Learning Rate

How much to scale each gradient update: W ← W - lr · ∇W. The single most important hyperparameter. Too high: diverges. Too low: trains extremely slowly.

Relationships & Terminology

Term	Definition	Typical value
Epoch	One full pass over all training samples	10 – 300+ (until convergence)
Batch size	Samples per gradient update step	32, 64, 128, 256
Iteration / Step	One forward + backward + update cycle	ceil(N / batch_size) per epoch
Learning rate	Step size multiplier for gradient updates	1e-4 to 1e-2 (model dependent)
LR schedule	Policy for changing LR during training	Cosine decay, warmup + decay
Early stopping	Stop when val loss hasn't improved in N epochs	patience = 5–20 epochs

Epoch Progress Visualizer

Training progress over 30 epochs (current: epoch 18)

■ completed ■ current □ remaining

Batch Size Trade-offs

Small Batch (8–64)

Noisy gradients act as regularization
Often generalizes better to unseen data
Slower wall-clock time per epoch
Lower memory requirement
Can use higher learning rate effectively

Large Batch (512–8192)

Accurate gradients → deterministic updates
Faster wall-clock time with parallel hardware
May require learning rate scaling (linear scaling rule)
Can overfit or converge to sharp minima
Standard for large language models with gradient accumulation

Learning Rate Schedules

PyTorch — LR schedules

from torch.optim import lr_scheduler

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

# ── Step decay: multiply LR by γ every N epochs ──
scheduler = lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

# ── Cosine annealing: smoothly decays to near-0 ──
scheduler = lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)

# ── Warm-up + cosine decay (preferred for transformers) ──
scheduler = lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=1e-2,
    steps_per_epoch=len(train_loader),
    epochs=30,
    pct_start=0.3,        # 30% warmup
)

# ── Reduce on plateau: detect stagnation ──
scheduler = lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=5
)

# ── Call after each epoch (or step for OneCycleLR) ──
for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    scheduler.step(val_loss)      # for ReduceLROnPlateau
    # scheduler.step()             # for all others

The Full Training Loop

PyTorch — complete training loop

import torch
from torch.utils.data import DataLoader

def train(model, train_loader, val_loader, epochs=30, lr=1e-3):
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    best_val_loss = float('inf')
    patience_count = 0
    patience = 5

    for epoch in range(epochs):

        # ── Training phase ──
        model.train()
        train_loss = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()          # clear old gradients
            logits = model(X_batch)          # forward pass
            loss = criterion(logits, y_batch) # compute loss
            loss.backward()                   # backward pass
            torch.nn.utils.clip_grad_norm_(  # gradient clipping
                model.parameters(), max_norm=1.0)
            optimizer.step()                  # weight update
            train_loss += loss.item()

        # ── Validation phase ──
        model.eval()
        val_loss = 0
        with torch.no_grad():               # disable autograd for eval
            for X_val, y_val in val_loader:
                logits = model(X_val)
                val_loss += criterion(logits, y_val).item()

        avg_train = train_loss / len(train_loader)
        avg_val   = val_loss   / len(val_loader)
        print(f"Epoch {epoch+1:3d} | train: {avg_train:.4f} | val: {avg_val:.4f}")

        # ── Early stopping ──
        if avg_val < best_val_loss:
            best_val_loss = avg_val
            patience_count = 0
            torch.save(model.state_dict(), 'best_model.pth')
        else:
            patience_count += 1
            if patience_count >= patience:
                print(f"Early stopping at epoch {epoch+1}")
                break

⚙️

Optimizers

How gradient information is turned into weight updates

SGD (Stochastic Gradient Descent)

The baseline. Update rule: W ← W - lr · ∇W. With momentum: accumulates a velocity vector in directions of consistent gradients. Good for convex problems and generalization in deep learning with careful tuning.

Adam

Adaptive learning rates per parameter using first (mean) and second (variance) moment estimates. Excellent default for most tasks. Variants: AdamW (decoupled weight decay), AdaGrad, RMSProp.

Adam update rule

m_t = β₁·m_(t-1) + (1-β₁)·∇W    (1st moment, mean)
v_t = β₂·v_(t-1) + (1-β₂)·∇W²    (2nd moment, variance)
m̂_t = m_t / (1 - β₁ᵗ)    (bias correction)
v̂_t = v_t / (1 - β₂ᵗ)
W ← W - lr · m̂_t / (√v̂_t + ε)    (β₁=0.9, β₂=0.999, ε=1e-8)

📉

Loss Functions

Measuring how wrong the network's predictions are

Loss	Task	Formula	PyTorch
Cross-Entropy	Multi-class classification	-Σ yᵢ·log(ŷᵢ)	`nn.CrossEntropyLoss()`
Binary Cross-Entropy	Binary classification	-y·log(ŷ) - (1-y)·log(1-ŷ)	`nn.BCEWithLogitsLoss()`
MSE	Regression	1/n · Σ(y - ŷ)²	`nn.MSELoss()`
MAE / L1	Regression (robust to outliers)	1/n · Σ\|y - ŷ\|	`nn.L1Loss()`
Huber	Regression (combines MSE+MAE)	MSE if \|e\|<δ, else δ·\|e\| - δ²/2	`nn.HuberLoss()`
KL Divergence	Distribution matching (VAEs)	Σ P·log(P/Q)	`nn.KLDivLoss()`

🛡️

Regularization

Preventing overfitting in neural networks

Dropout

Randomly zeros out neurons during training with probability p (typically 0.2–0.5). Forces the network to learn redundant representations. Disabled at inference time.

Batch Normalization

Normalizes layer inputs to zero mean and unit variance per mini-batch. Reduces sensitivity to initialization, enables higher learning rates, acts as a regularizer.

Weight Decay (L2)

Adds a penalty term λ·||W||² to the loss that encourages small weights. AdamW applies weight decay correctly (decoupled from the adaptive LR). Use weight_decay=1e-4 as a starting point.

Data Augmentation

Artificially expands the training set by applying random transforms (flips, crops, rotations, color jitter for vision; token masking, paraphrase for NLP). The most effective regularizer when data is scarce.

💻

Code Examples

End-to-end patterns in PyTorch and TensorFlow/Keras

PyTorch — Custom MLP with Best Practices

PyTorch

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim, dropout=0.3):
        super().__init__()
        layers = []
        prev_dim = input_dim

        for h in hidden_dims:
            layers += [
                nn.Linear(prev_dim, h),
                nn.BatchNorm1d(h),
                nn.GELU(),
                nn.Dropout(dropout),
            ]
            prev_dim = h

        layers.append(nn.Linear(prev_dim, output_dim))
        self.net = nn.Sequential(*layers)

        # He initialization for all linear layers
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
                nn.init.zeros_(m.bias)

    def forward(self, x):
        return self.net(x)

# ── Instantiate ──
model = MLP(
    input_dim=784,
    hidden_dims=[512, 256, 128],
    output_dim=10,
    dropout=0.3,
)
print(sum(p.numel() for p in model.parameters()),
      'trainable parameters')

TensorFlow / Keras

TensorFlow 2 / Keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

def build_mlp(input_dim, hidden_dims, output_dim, dropout=0.3):
    inputs = keras.Input(shape=(input_dim,))
    x = inputs
    for h in hidden_dims:
        x = layers.Dense(h, kernel_initializer='he_normal')(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation('gelu')(x)
        x = layers.Dropout(dropout)(x)
    outputs = layers.Dense(output_dim)(x)
    return keras.Model(inputs, outputs)

model = build_mlp(784, [512, 256, 128], 10)
model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=1e-3, weight_decay=1e-4),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)
model.fit(
    x_train, y_train,
    epochs=30,
    batch_size=128,
    validation_split=0.15,
    callbacks=[
        keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
        keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3),
    ]
)

📖

Glossary

Key terms and definitions

Term	Definition
Neuron / Node	A single computational unit: computes a weighted sum of inputs, adds a bias, applies an activation function.
Weight	A learnable scalar parameter that scales an input. Stored in matrices W for efficient computation.
Bias	A learnable offset added to the weighted sum. Allows neurons to activate even when all inputs are zero.
Activation	A non-linear function applied after the linear transformation. Introduces expressive power to the network.
Layer	A collection of neurons operating in parallel. Input, hidden, and output layers form the network's depth.
Parameters	All learnable values in a network — all weights and biases. The count determines model capacity.
Loss / Cost	A scalar measuring prediction error. The objective function minimized by gradient descent.
Gradient	The vector of partial derivatives of the loss w.r.t. each parameter. Points in the direction of steepest ascent.
Gradient descent	Iteratively move parameters opposite to the gradient to minimize loss. The fundamental optimization algorithm.
Mini-batch	A subset of training data. Stochastic gradient descent (SGD) uses batch_size=1; mini-batch uses 16–512.
Epoch	One complete pass over the entire training dataset (many mini-batch updates).
Learning rate	The step size multiplier for parameter updates. Controls how fast the model learns.
Overfitting	Model memorizes training data but fails to generalize. Low train loss, high val loss.
Underfitting	Model is too simple or undertrained. High train AND val loss.
Generalization	The ability to perform well on unseen data. The ultimate goal of training.
Inference	Running the trained model to make predictions on new data. No gradient computation needed.
Logit	The raw (pre-softmax/sigmoid) output of the final linear layer.
He initialization	Weight init scheme for ReLU networks: W ~ N(0, 2/fan_in). Prevents vanishing/exploding activations.

🗒️

Cheat Sheet

Quick-reference one-liners for daily neural network work

Key hyperparameter starting points

Learning rate: 3e-4 (Karpathy's constant for Adam)
Batch size: 32 or 64 to start
Weight decay: 1e-4 with AdamW
Dropout: 0.1–0.3 hidden, 0.0 on output
Gradient clipping: max_norm=1.0

Debugging checklist

Overfit a single batch first — loss should reach ~0
Check input normalization (zero mean, unit variance)
Print gradient norms — should be ~1e-1 to 1e-3
Confirm model.train() / model.eval() toggling
Use torch.no_grad() for all eval/inference

Architecture rules of thumb

Use BatchNorm before activation in deep networks
Skip connections (ResNet-style) for >10 layers
Start with fewer parameters, scale up gradually
He init for ReLU/GELU, Xavier for sigmoid/tanh

Output layer guide

Binary classification: 1 neuron + sigmoid + BCE
Multi-class: N neurons + softmax + CrossEntropy
Regression: N neurons + no activation + MSE/Huber
Multi-label: N neurons + sigmoid + BCE per class

🔑

The fundamental trade-off: Every decision in neural network design balances model capacity (can it represent the target function?) against generalization (will it work on unseen data?). More parameters = more capacity = more risk of overfitting. Regularization (dropout, weight decay, data augmentation, early stopping) is the toolkit for managing this trade-off.