ML Fundamentals Handbook

Machine
Learning

From first principles to production — every concept explained with the why, not just the what. Built for practitioners who want intuition, not just definitions.

Supervised Unsupervised Deep Learning Optimisation Statistics

01 What Is Machine Learning?

Traditional programming gives a computer explicit rules to produce outputs from inputs. Machine learning inverts this: you give the computer inputs and desired outputs, and it figures out the rules itself. The "learning" is the process of finding those rules — usually expressed as the parameters of a mathematical model.

Arthur Samuel's 1959 definition remains the clearest: "A field of study that gives computers the ability to learn without being explicitly programmed." Tom Mitchell's 1997 formalism makes it precise: a program learns from experience E with respect to task T and performance measure P, if its performance on T improves with E.

💡

The key insight: ML is useful when the rules are too complex, too numerous, or simply unknown — spam filters, image recognition, stock prediction. If you can write down the rules explicitly, you probably don't need ML.

Data (examples)

Algorithm

Training

→

Model

→

Predictions

02 Types of Learning

Supervised Learning

Train on labelled examples — each input has a known correct output. The model learns a mapping from inputs to outputs by minimising prediction error on training data. This is the most common type in practice. Classification (discrete outputs: spam/not-spam) and regression (continuous outputs: house price) are both supervised.

You need labels, which are expensive to acquire. The model can only generalise to patterns it has seen a version of in training.

Unsupervised Learning

No labels — find hidden structure in raw data. The model must discover patterns, groupings, or representations without being told what to look for. Used for clustering (group customers by behaviour), dimensionality reduction (compress data while preserving structure), and density estimation (model the data distribution).

Harder to evaluate because there's no ground truth. "Is this clustering good?" depends on what you want to do with it.

Semi-Supervised Learning

Uses a small amount of labelled data and a large amount of unlabelled data. The unlabelled data provides structural information about the input space; the labels provide signal about which structures matter. Works well when labelling is expensive but data collection is cheap — medical imaging is a classic case.

The underlying assumption is that data points near each other in input space are likely to share the same label (smoothness assumption).

Reinforcement Learning (RL)

An agent takes actions in an environment, receives rewards or penalties, and learns to maximise cumulative reward over time. Unlike supervised learning, there's no correct answer provided — the agent must discover it through trial and error. AlphaGo, game-playing AIs, and robot control are RL problems.

The key challenge is the credit assignment problem: which of the many actions in a sequence actually caused the final reward to be high or low?

Self-Supervised Learning

A form of unsupervised learning where labels are generated from the data itself. Predict the next word in a sentence, predict a masked image patch, predict if two views of the same image came from the same source. The task is artificial, but the representations learned are rich and transfer to real tasks. This is how GPT, BERT, and most modern LLMs are pretrained.

Self-supervision scales: you can train on the entire internet without human annotation. This is why LLMs are so powerful.

03 Core Terminology

Model

A mathematical function that maps inputs to outputs. Defined by its architecture (the structure of the function — e.g., a decision tree, a neural network with 12 layers) and its parameters (the numbers that get adjusted during training — weights, biases).

The architecture is chosen by the designer; the parameters are learned from data. A model with the wrong architecture can't learn even with infinite data.

Parametersaka weights, coefficients

The numbers inside the model that are adjusted during training. In a linear model y = wx + b, w and b are parameters. In a neural network with billions of neurons, there are billions of parameters. The entire act of training is finding good parameter values.

More parameters = more expressive power, but also more data needed and more risk of overfitting.

Hyperparameters

Settings that control the training process itself — not learned from data, but set by the practitioner before training. Examples: learning rate, number of layers, number of trees in a forest, batch size, regularisation strength. Getting hyperparameters wrong is one of the most common causes of poor performance.

Hyperparameters sit "above" (hyper) the model parameters. You tune them using a validation set, not the test set — otherwise you're cheating.

Training, Validation & Test Sets

The dataset is split into three parts. Training set: used to fit model parameters. Validation set: used to tune hyperparameters and select the best model — the model never learns from this directly. Test set: used once, at the very end, to estimate real-world performance. If you evaluate on the test set multiple times and pick the best result, you have leaked information.

A common ratio is 70/15/15 or 80/10/10. For very large datasets, even 1% for validation may be thousands of examples.

Inference

Using a trained model to make predictions on new, unseen data. During inference, parameters are fixed — no learning happens. This is what happens when an ML system runs in production: image classifiers, recommendation engines, and LLMs all run inference.

Training is expensive (days/weeks, large compute). Inference must be fast (milliseconds, at scale). This tension shapes system design decisions.

Generalisation

A model generalises when it performs well on data it has never seen before — not just the training set. This is the whole point of ML. A model that only memorises training examples is useless. Good generalisation means the model has learned the underlying pattern, not the noise.

The gap between training performance and generalisation performance tells you almost everything about what's wrong with a model.

Gradient

A vector that points in the direction of steepest increase of a function. In ML, we compute the gradient of the loss function with respect to each parameter. Moving parameters in the opposite direction (gradient descent) reduces the loss. Gradients are calculated using backpropagation in neural networks.

Intuition: imagine you're blindfolded on a hilly terrain trying to reach the lowest point. The gradient tells you which direction is "uphill" so you can step downhill.

04 Data Fundamentals

Featureaka input variable, predictor, attribute

An individual measurable property of an observation. Height, age, pixel value, word frequency — these are features. A dataset with n observations and p features has shape (n, p). The model learns how features relate to the target output.

Labelaka target, output variable, y

The ground-truth answer the model is trained to predict. In supervised learning, every training example has a label. In binary classification, labels are 0 or 1. In regression, they're continuous numbers. In multi-class classification, they're one of K categories.

Data Imbalance

When one class is far more common than others in a classification problem. Fraud detection: 99.9% legitimate transactions, 0.1% fraud. A model that predicts "legitimate" for everything achieves 99.9% accuracy but is completely useless. Solutions: oversample the minority class (SMOTE), undersample the majority, use class-weighted loss, or evaluate with precision/recall/F1 rather than accuracy.

Accuracy is a misleading metric on imbalanced datasets. Always check the class distribution before choosing your metric.

Data Leakage

When information from outside the training data boundary inadvertently enters the model's training. Examples: including future data in features, preprocessing the entire dataset (including test) before splitting, normalising using statistics computed on the full dataset. Leakage causes misleadingly optimistic performance that evaporates in production.

Leakage is often subtle. "My model got 99% accuracy in testing but failed completely in production" is almost always leakage.

Data Type	Description	ML Treatment
Continuous / Numerical	Real-valued: age, temperature, price	Normalise or standardise; works directly with most algorithms
Categorical / Nominal	Unordered categories: colour, city, species	One-hot encoding; embedding (for high cardinality)
Ordinal	Ordered categories: small/medium/large, rating 1–5	Label encoding preserving order; or treat as continuous
Binary	Two values: yes/no, true/false	0/1 encoding; works as-is in most algorithms
Text	Strings: reviews, documents, code	Tokenise → embed; TF-IDF; language model features
Image	Pixel arrays	Convolutional networks; normalise pixels to [0,1]
Time Series	Sequences with temporal order	Respect temporal split; lag features; RNN/Transformer

05 Feature Engineering

The process of using domain knowledge to transform raw data into features that better represent the underlying problem to the model. Feature engineering is often the highest-leverage activity in applied ML — better features beat better algorithms most of the time.

One-Hot Encoding

Convert a categorical variable with K categories into K binary columns. "City: London/Paris/Tokyo" becomes three columns: is_London, is_Paris, is_Tokyo. This prevents the model from incorrectly treating categories as ordered (treating "Tokyo" as 3× "London" just because it's category 3).

Curse: if a category has 10,000 values (postcodes), you get 10,000 columns. Use embeddings or target encoding for high-cardinality categoricals.

Feature Scaling

Standardisation (Z-score): subtract mean, divide by std → zero mean, unit variance. Min-max normalisation: scale to [0,1]. Many algorithms (SVMs, neural nets, kNN) are sensitive to feature scale — a feature ranging 0–1,000,000 will dominate a feature ranging 0–1. Decision trees and random forests are scale-invariant.

Always fit the scaler on training data only, then apply to validation and test. If you compute mean/std on the full dataset, you've leaked test statistics into training.

Embeddings

Dense vector representations of discrete objects (words, users, products). A word embedding maps "king" to a 300-dimensional vector where similar words are nearby in vector space. Learned jointly with the model or pre-trained separately. Key property: algebraic meaning — king − man + woman ≈ queen.

Embeddings are how ML handles high-cardinality categorical data and non-numeric data (text, graphs) without exploding dimensionality.

06 Preprocessing

Missing Values

Deletion: drop rows or columns with missing data (loses information; okay if <1% missing). Mean/median imputation: fill with column mean/median (distorts variance; simple). Model imputation: predict missing values using other features (KNN imputation, iterative imputation). Indicator flag: add a binary "was_missing" column (tells model that missingness itself carries signal).

Missing data is often non-random (MNAR: Missing Not At Random). A patient refusing to answer a health question may itself be informative.

Outliers

Extreme values far from the typical range. Can be genuine (billionaire in income data) or errors (sensor malfunction). Detection: Z-score >3, IQR method, isolation forests. Treatment: clip (Winsorisation), log-transform skewed distributions, remove if certain errors, or use robust algorithms (median-based, tree models).

Outliers in the target variable (y) are especially dangerous for regression — one extreme point can completely skew the learned line.

07 Bias–Variance Tradeoff

The most fundamental tension in machine learning. Total expected error = Bias² + Variance + Irreducible Noise.

Bias

Error from incorrect assumptions about the data. A high-bias model is too simple — it can't capture the real pattern. Underfitting: poor performance on both training and test data. Example: fitting a straight line to data that's fundamentally curved. The model makes systematic, predictable errors regardless of how much data you give it.

High bias models fail not because they've learned the wrong thing, but because they can't learn enough. More data doesn't help — you need a more expressive model.

Variance

Error from sensitivity to small fluctuations in training data. A high-variance model is too complex — it memorises training data including noise. Overfitting: excellent performance on training, poor on test. If you retrained on a slightly different sample, you'd get a very different model. Deep neural networks without regularisation are notorious for high variance.

High variance models fail because they've learned the training data, not the underlying pattern. More data usually helps. Regularisation is the other fix.

⚖

The tradeoff: Reducing bias usually increases variance (use a more complex model → it can fit more patterns but is more sensitive to noise). Reducing variance usually increases bias (constrain/simplify the model → more robust but less expressive). The art of ML is finding the sweet spot for your dataset size and problem complexity.

08 Loss Functions

The loss (or cost) function measures how wrong the model's predictions are. Training = minimising the loss. Your choice of loss function shapes what the model optimises for, which directly affects behaviour.

Mean Squared Error (MSE)regression

Average of squared prediction errors. Squaring has two effects: errors are always positive (so negative and positive errors don't cancel), and large errors are penalised disproportionately (a 10× larger error becomes 100× larger in the loss). This sensitivity to outliers can be a feature or a bug depending on the problem.

If you care more about large errors than small ones, MSE is natural. If outliers shouldn't dominate, use MAE (Mean Absolute Error) instead.

MSE = (1/n) Σ (y_i − ŷ_i)²

Cross-Entropy Lossclassification, aka log loss

Measures the difference between two probability distributions — the model's predicted probabilities and the true labels. When the model is confident and correct, loss is near zero. When the model is confident and wrong, loss is very large. This asymmetry is what makes cross-entropy work so well: it severely punishes confident wrong predictions.

Binary cross-entropy is used for two classes; categorical cross-entropy for multiple classes. If your model outputs a raw score (not a probability), you need a softmax layer first.

CE = −Σ y_i · log(ŷ_i)

Hinge LossSVMs, classification

Used in Support Vector Machines. Zero loss when the prediction is correct with sufficient margin (confidence). Positive loss when the prediction is wrong or correct but too close to the boundary. This encourages a decision boundary that sits far from any training examples — the maximum-margin classifier.

09 Optimisation

Gradient Descent

The workhorse of ML. Repeatedly move parameters in the direction of the negative gradient of the loss — i.e., in the direction that most reduces loss. Full-batch GD: use all data to compute each gradient (accurate but slow). Stochastic GD (SGD): use one example at a time (fast but noisy). Mini-batch GD: use a batch of 32–256 examples (the practical default).

The loss landscape in deep networks is a high-dimensional surface with many local minima, saddle points, and flat regions. SGD's noise often helps escape bad local minima.

Learning Rate

Controls how big a step to take in the direction of the gradient. Too high → oscillate wildly, overshoot minima, training diverges. Too low → training is glacially slow, gets stuck in local minima. The most important hyperparameter in deep learning. Learning rate schedules (decay, cosine annealing, warmup) adapt it during training.

A common debugging technique: plot the loss vs. learning rate for a short training run ("LR range test"). Use the rate where the loss drops fastest, before it explodes.

Momentum

Adds a fraction of the previous gradient update to the current one — like a ball rolling downhill that keeps its velocity. Smooths the optimisation path, reduces oscillation in "ravines" (narrow valleys in the loss landscape), and helps escape shallow local minima. Used in SGD+Momentum and is a core component of Adam.

Adam Optimiser

The default optimiser for most deep learning. Combines momentum (first moment — tracks gradient direction) with adaptive learning rates per parameter (second moment — reduces learning rate for frequently updated parameters, increases it for rarely updated ones). Self-tunes, requires minimal hyperparameter search, fast convergence. Almost always a good starting point.

Adam sometimes generalises slightly worse than SGD+Momentum — the adaptive rates can cause it to "memorise" training data. Some practitioners switch to SGD late in training.

Backpropagation

The algorithm for computing gradients in neural networks efficiently. Uses the chain rule of calculus to propagate error signals backwards from the output layer through all hidden layers to the input. Without backprop, training deep networks would be computationally intractable. It's the mathematical engine that makes deep learning possible.

Backprop doesn't change the network architecture or choose what to learn — it's purely a gradient computation algorithm. The learning comes from what gradient descent does with those gradients.

Batch Size

Number of training examples used in one gradient update. Large batches → more accurate gradient estimate, faster parallelism on GPUs, but may generalise worse (sharp minima). Small batches → noisy gradients, slower per-step, but often better generalisation (noise acts as implicit regularisation). Common values: 32, 64, 128, 256.

A practical rule: if your model doesn't train, try a smaller batch size. Large-batch training often requires a corresponding increase in learning rate (linear scaling rule).

10 Regularisation

Techniques that reduce overfitting by constraining model complexity — penalising complexity, adding noise, or averaging multiple models.

L1 RegularisationLasso

Adds sum of absolute values of weights to the loss (λ·Σ|w|). Encourages sparsity — many weights go to exactly zero, effectively eliminating features. The model performs automatic feature selection. When you have many features and expect most to be irrelevant, L1 is a good choice.

L2 RegularisationRidge, Weight Decay

Adds sum of squared weights to the loss (λ·Σw²). Penalises large weights, distributing the influence across all features. Weights shrink toward zero but rarely reach exactly zero. Works well when many features each have small predictive power. Standard default for neural networks — the λ term is called "weight decay".

Dropout

During training, randomly set a fraction of neurons' outputs to zero on each forward pass (typically 20–50% of neurons). Forces the network to learn redundant representations — any neuron can be absent, so every neuron must be useful independently. At inference, all neurons are active but outputs are scaled to compensate. One of the most effective regularisers for deep networks.

Dropout is equivalent to training an exponential number of thinned networks and averaging them at inference — an implicit ensemble.

Early Stopping

Monitor validation loss during training. Stop when it stops improving (or starts increasing) even if training loss is still decreasing. This is the point where the model begins to overfit. The "best" checkpoint is saved and used as the final model. Simple, free regularisation that should almost always be used.

Early stopping is a form of hyperparameter optimisation — the number of training epochs is implicitly chosen by the validation curve.

11 Supervised Algorithms

Supervised · Regression & Classification

Linear / Logistic Regression

Linear regression: predict continuous output as weighted sum of inputs. Logistic regression: apply sigmoid to linear output for probability of class membership. Interpretable, fast, great baseline. Assumes linear relationship between features and output.

Use when: interpretability matters, few features, linear separability

Supervised · Classification

k-Nearest Neighbours (kNN)

Predict by finding the k most similar training examples and taking a majority vote (classification) or average (regression). No training phase — the entire dataset is the "model." Distance-based, so feature scaling is critical.

Use when: small dataset, non-linear boundary, need simple baseline

Supervised · Classification

Support Vector Machine (SVM)

Find the hyperplane that maximises the margin between classes. The "kernel trick" maps data to higher dimensions where it becomes linearly separable. Very effective for high-dimensional data (text). Sensitive to scaling.

Use when: text classification, high-dimensional sparse features

Supervised · Both

Decision Tree

Recursively split data on the feature that best separates classes at each node. Perfectly interpretable (literally a flowchart). High variance — tends to overfit. The atomic unit of ensemble methods (Random Forest, Gradient Boosting).

Use when: interpretability is key, mixed feature types, missing values

Supervised · Both

Naïve Bayes

Apply Bayes' theorem, assuming features are conditionally independent given the class. "Naïve" because the independence assumption is usually violated. Despite this, works remarkably well for text classification. Extremely fast to train and predict.

Use when: text/spam filtering, tiny datasets, real-time prediction

12 Unsupervised Algorithms

Unsupervised · Clustering

K-Means

Partition data into K clusters by iteratively assigning points to the nearest centroid, then updating centroids. K must be chosen in advance. Assumes spherical clusters. Fast but sensitive to initialisation and outliers. Pick K using the elbow method or silhouette score.

Use when: customer segmentation, image compression, data exploration

Unsupervised · Dimensionality Reduction

PCA (Principal Component Analysis)

Find orthogonal directions of maximum variance in the data (principal components). Project data onto the top K components, discarding the rest. Linear, interpretable, fast. Used for visualisation (reduce to 2D), compression, and as a preprocessing step.

Use when: visualise high-dimensional data, remove correlated features, noise reduction

Unsupervised · Clustering

DBSCAN

Density-Based Spatial Clustering. Groups points that are closely packed, marks outliers as noise. Doesn't require K to be specified. Finds arbitrary-shaped clusters. Two parameters: epsilon (neighbourhood radius) and min_samples.

Use when: unknown number of clusters, need outlier detection, non-spherical shapes

Unsupervised · Dimensionality Reduction

t-SNE / UMAP

Non-linear dimensionality reduction for visualisation. Preserves local structure — nearby points in high-D stay nearby in 2D. t-SNE is slower; UMAP is faster and better preserves global structure. Used almost exclusively for visualisation, not as a preprocessing step for models.

Use when: visualising embeddings, understanding cluster structure, exploring data

13 Ensemble Methods

Combine multiple models to get better performance than any single model. The key insight: diverse models that fail independently can be combined to cancel each other's errors.

BaggingBootstrap Aggregating

Train multiple models on different random subsets of the training data (with replacement). Average their predictions (regression) or take a vote (classification). Reduces variance. Works best with high-variance base learners like deep decision trees. Random Forest extends this by also randomly selecting features at each split.

Random Forest is one of the most reliable algorithms in practical ML — good out of the box, robust to overfitting, handles missing values and mixed types well.

Boosting

Train models sequentially — each model focuses on the examples the previous model got wrong by upweighting those errors. The final prediction is a weighted sum of all models. Reduces bias. The current dominant method for tabular data. Key implementations: AdaBoost (original), Gradient Boosting, XGBoost, LightGBM, CatBoost.

XGBoost/LightGBM win most tabular data competitions and outperforms neural networks on structured data. If you're ignoring gradient boosting for a tabular problem, you're leaving performance on the table.

Stacking

Train a "meta-model" that takes the predictions of several base models as its input features. The base models learn different aspects of the data; the meta-model learns how to best combine their knowledge. Most powerful ensemble approach but expensive to train and tune.

14 Neural Networks

Neuron (Perceptron)

The basic unit. Takes multiple inputs, multiplies each by a weight, sums them, adds a bias, then passes through an activation function. The activation introduces non-linearity — without it, a stack of linear layers is still just linear. One neuron can only learn a linear boundary; a network of neurons can approximate any function (Universal Approximation Theorem).

Activation Functions

ReLU (max(0,x)): fast, avoids vanishing gradient, default choice. Sigmoid: squashes to (0,1), good for output probability in binary classification, bad in hidden layers (vanishing gradient). Tanh: squashes to (-1,1), better than sigmoid for hidden layers. GELU/SiLU: smooth ReLU variants used in Transformers. Softmax: normalises K outputs to a probability distribution — always the final layer in multi-class classification.

Vanishing / Exploding Gradients

In deep networks, gradients are multiplied through many layers during backprop. If each multiplication is <1 (vanishing), gradients become exponentially small in early layers — those layers stop learning. If >1 (exploding), gradients become exponentially large — training diverges. Solutions: ReLU activations, batch normalisation, skip connections (ResNets), gradient clipping.

Vanishing gradients are why training networks with >10 layers was nearly impossible before 2015. ResNets solved this for images; Transformers use layer norm and residual connections for the same reason.

Convolutional Neural Network (CNN)

Designed for grid-structured data (images, time series). Uses convolutional filters — small learnable patterns that slide across the input detecting local features (edges, textures, shapes). Key properties: local connectivity (a neuron sees only a small region), weight sharing (the same filter is applied everywhere), spatial hierarchy (early layers detect simple features, later layers detect complex compositions).

Recurrent Neural Network (RNN)

Processes sequences by maintaining a hidden state that carries information from previous steps. Can theoretically learn long-term dependencies. In practice, vanilla RNNs forget distant context due to vanishing gradients. LSTM and GRU architectures use gating mechanisms to selectively remember or forget. Largely superseded by Transformers for NLP.

Batch Normalisation

Normalise the activations of each layer to have zero mean and unit variance across a mini-batch, then learn per-layer scale and shift parameters. Reduces internal covariate shift, allows higher learning rates, reduces sensitivity to weight initialisation, provides slight regularisation. A standard component in modern deep networks.

Without batch norm, tuning deep networks was extremely finicky. With it, training became more stable and faster. Transformers use Layer Norm instead (normalise across features rather than batch).

15 Evaluation Metrics

Metric	Formula	When to Use	Pitfall
Accuracy	Correct / Total	Balanced classes, simple reporting	Useless on imbalanced datasets
Precision	TP / (TP + FP)	Cost of false positive is high (spam filter)	Ignores false negatives
Recall (Sensitivity)	TP / (TP + FN)	Cost of false negative is high (cancer detection)	Ignores false positives
F1 Score	2·(P·R)/(P+R)	Balance precision & recall; imbalanced data	Doesn't distinguish between P and R
AUC-ROC	Area under TPR vs FPR curve	Ranking quality, threshold-invariant comparison	Misleading on highly imbalanced data
RMSE	√(MSE)	Regression, same units as target	Sensitive to outliers (like MSE)
MAE	Mean \|y − ŷ\|	Regression, outlier-robust reporting	Less sensitive to large errors than RMSE
R² (R-squared)	1 − SS_res/SS_tot	Variance explained, relative comparison	Can be negative; always increases with features

Confusion Matrix

A 2×2 (or K×K) table showing prediction outcomes. True Positives (TP): correctly predicted positive. True Negatives (TN): correctly predicted negative. False Positives (FP): predicted positive, actually negative (Type I error). False Negatives (FN): predicted negative, actually positive (Type II error). Every classification metric derives from this table.

Always look at the confusion matrix, not just the summary metric. A model might have 95% accuracy but never correctly predict the minority class.

16 Validation Strategies

K-Fold Cross-Validation

Split data into K equal folds. Train on K-1 folds, validate on the remaining fold. Repeat K times so every fold serves as validation once. Report average performance across folds. Gives a more reliable estimate than a single train/validation split — especially for small datasets. Common values: K=5 or K=10.

For time series data, use TimeSeriesSplit — always validate on data that comes after training data in time. Never shuffle time series data before splitting.

Leave-One-Out CV (LOOCV)

K-fold where K = n (number of examples). Train on n-1 examples, validate on the one left out. Repeat n times. Nearly unbiased estimate of true error but computationally expensive (n full training runs). Practical only for small datasets or very fast models.

Stratified Sampling

Ensure each fold contains approximately the same proportion of each class as the full dataset. Critical for imbalanced problems — random splits might put all rare examples in one fold. Stratified K-fold is the default for classification; regular K-fold works for regression.

17 The ML Pipeline

1. Define Problem

Classification, regression, clustering? Success metric? Business goal?

↓

2. Collect Data

Volume, quality, bias. Check distributions and label quality.

↓

3. Explore (EDA)

Distributions, correlations, outliers, missing values, class balance.

↓

4. Preprocess

Clean, encode, scale, split into train/val/test. Fit on train only.

↓

5. Baseline Model

Start simple: logistic regression, random forest. Know what you're beating.

↓

6. Feature Engineering

Domain knowledge. Interaction terms. Transformations. Often highest leverage.

↓

7. Model Selection

Try multiple architectures. Compare on validation set.

↓

8. Hyperparameter Tuning

Grid search, random search, Bayesian optimisation. Use val set.

↓

9. Final Evaluation

Evaluate best model on test set. Once. Report this number.

↓

10. Deploy & Monitor

Data drift, model drift, latency, feedback loops.

18 Common Pitfalls

Pitfall	Symptom	Fix
Data leakage	Suspiciously high accuracy; fails in production	Ensure test set is never used during feature engineering or normalisation
Overfitting	High train accuracy, low val accuracy	Regularisation, more data, simpler model, dropout, early stopping
Underfitting	Low accuracy on both train and val	More complex model, more features, reduce regularisation
Wrong metric	Model looks great but business goal not met	Define the metric before training; align with stakeholders
Class imbalance ignored	Predicts majority class always; high accuracy, zero recall for minority	Stratified sampling, class weights, appropriate metric (F1, AUC)
Not scaling features	SVMs, neural nets, kNN perform poorly	Standardise/normalise; fit scaler on train only
Test set peeking	Optimistic test score; real performance worse	Lock test set; use validation set for all development decisions
Ignoring baselines	"ML model" that doesn't beat a simple heuristic	Always compare against majority class, mean prediction, or rule-based system

MachineLearning