ML Fundamentals Handbook
Machine
Learning
From first principles to production — every concept explained with the why, not just the what. Built for practitioners who want intuition, not just definitions.
Supervised
Unsupervised
Deep Learning
Optimisation
Statistics
01 What Is Machine Learning?
Traditional programming gives a computer explicit rules to produce outputs from inputs. Machine learning inverts this: you give the computer inputs and desired outputs, and it figures out the rules itself. The "learning" is the process of finding those rules — usually expressed as the parameters of a mathematical model.
Arthur Samuel's 1959 definition remains the clearest: "A field of study that gives computers the ability to learn without being explicitly programmed." Tom Mitchell's 1997 formalism makes it precise: a program learns from experience E with respect to task T and performance measure P, if its performance on T improves with E.
💡
The key insight: ML is useful when the rules are too complex, too numerous, or simply unknown — spam filters, image recognition, stock prediction. If you can write down the rules explicitly, you probably don't need ML.
Data (examples)
+
Algorithm
+
Training
→
Model
→
Predictions
02 Types of Learning
Supervised Learning
Train on labelled examples — each input has a known correct output. The model learns a mapping from inputs to outputs by minimising prediction error on training data. This is the most common type in practice. Classification (discrete outputs: spam/not-spam) and regression (continuous outputs: house price) are both supervised.
You need labels, which are expensive to acquire. The model can only generalise to patterns it has seen a version of in training.
Unsupervised Learning
No labels — find hidden structure in raw data. The model must discover patterns, groupings, or representations without being told what to look for. Used for clustering (group customers by behaviour), dimensionality reduction (compress data while preserving structure), and density estimation (model the data distribution).
Harder to evaluate because there's no ground truth. "Is this clustering good?" depends on what you want to do with it.
Semi-Supervised Learning
Uses a small amount of labelled data and a large amount of unlabelled data. The unlabelled data provides structural information about the input space; the labels provide signal about which structures matter. Works well when labelling is expensive but data collection is cheap — medical imaging is a classic case.
The underlying assumption is that data points near each other in input space are likely to share the same label (smoothness assumption).
Reinforcement Learning (RL)
An agent takes actions in an environment, receives rewards or penalties, and learns to maximise cumulative reward over time. Unlike supervised learning, there's no correct answer provided — the agent must discover it through trial and error. AlphaGo, game-playing AIs, and robot control are RL problems.
The key challenge is the credit assignment problem: which of the many actions in a sequence actually caused the final reward to be high or low?
Self-Supervised Learning
A form of unsupervised learning where labels are generated from the data itself. Predict the next word in a sentence, predict a masked image patch, predict if two views of the same image came from the same source. The task is artificial, but the representations learned are rich and transfer to real tasks. This is how GPT, BERT, and most modern LLMs are pretrained.
Self-supervision scales: you can train on the entire internet without human annotation. This is why LLMs are so powerful.
03 Core Terminology
Model
A mathematical function that maps inputs to outputs. Defined by its architecture (the structure of the function — e.g., a decision tree, a neural network with 12 layers) and its parameters (the numbers that get adjusted during training — weights, biases).
The architecture is chosen by the designer; the parameters are learned from data. A model with the wrong architecture can't learn even with infinite data.
Parametersaka weights, coefficients
The numbers inside the model that are adjusted during training. In a linear model y = wx + b, w and b are parameters. In a neural network with billions of neurons, there are billions of parameters. The entire act of training is finding good parameter values.
More parameters = more expressive power, but also more data needed and more risk of overfitting.
Hyperparameters
Settings that control the training process itself — not learned from data, but set by the practitioner before training. Examples: learning rate, number of layers, number of trees in a forest, batch size, regularisation strength. Getting hyperparameters wrong is one of the most common causes of poor performance.
Hyperparameters sit "above" (hyper) the model parameters. You tune them using a validation set, not the test set — otherwise you're cheating.
Training, Validation & Test Sets
The dataset is split into three parts. Training set: used to fit model parameters. Validation set: used to tune hyperparameters and select the best model — the model never learns from this directly. Test set: used once, at the very end, to estimate real-world performance. If you evaluate on the test set multiple times and pick the best result, you have leaked information.
A common ratio is 70/15/15 or 80/10/10. For very large datasets, even 1% for validation may be thousands of examples.
Inference
Using a trained model to make predictions on new, unseen data. During inference, parameters are fixed — no learning happens. This is what happens when an ML system runs in production: image classifiers, recommendation engines, and LLMs all run inference.
Training is expensive (days/weeks, large compute). Inference must be fast (milliseconds, at scale). This tension shapes system design decisions.
Generalisation
A model generalises when it performs well on data it has never seen before — not just the training set. This is the whole point of ML. A model that only memorises training examples is useless. Good generalisation means the model has learned the underlying pattern, not the noise.
The gap between training performance and generalisation performance tells you almost everything about what's wrong with a model.
Gradient
A vector that points in the direction of steepest increase of a function. In ML, we compute the gradient of the loss function with respect to each parameter. Moving parameters in the opposite direction (gradient descent) reduces the loss. Gradients are calculated using backpropagation in neural networks.
Intuition: imagine you're blindfolded on a hilly terrain trying to reach the lowest point. The gradient tells you which direction is "uphill" so you can step downhill.
04 Data Fundamentals
Featureaka input variable, predictor, attribute
An individual measurable property of an observation. Height, age, pixel value, word frequency — these are features. A dataset with n observations and p features has shape (n, p). The model learns how features relate to the target output.
Labelaka target, output variable, y
The ground-truth answer the model is trained to predict. In supervised learning, every training example has a label. In binary classification, labels are 0 or 1. In regression, they're continuous numbers. In multi-class classification, they're one of K categories.
Data Imbalance
When one class is far more common than others in a classification problem. Fraud detection: 99.9% legitimate transactions, 0.1% fraud. A model that predicts "legitimate" for everything achieves 99.9% accuracy but is completely useless. Solutions: oversample the minority class (SMOTE), undersample the majority, use class-weighted loss, or evaluate with precision/recall/F1 rather than accuracy.
Accuracy is a misleading metric on imbalanced datasets. Always check the class distribution before choosing your metric.
Data Leakage
When information from outside the training data boundary inadvertently enters the model's training. Examples: including future data in features, preprocessing the entire dataset (including test) before splitting, normalising using statistics computed on the full dataset. Leakage causes misleadingly optimistic performance that evaporates in production.
Leakage is often subtle. "My model got 99% accuracy in testing but failed completely in production" is almost always leakage.
| Data Type | Description | ML Treatment |
| Continuous / Numerical | Real-valued: age, temperature, price | Normalise or standardise; works directly with most algorithms |
| Categorical / Nominal | Unordered categories: colour, city, species | One-hot encoding; embedding (for high cardinality) |
| Ordinal | Ordered categories: small/medium/large, rating 1–5 | Label encoding preserving order; or treat as continuous |
| Binary | Two values: yes/no, true/false | 0/1 encoding; works as-is in most algorithms |
| Text | Strings: reviews, documents, code | Tokenise → embed; TF-IDF; language model features |
| Image | Pixel arrays | Convolutional networks; normalise pixels to [0,1] |
| Time Series | Sequences with temporal order | Respect temporal split; lag features; RNN/Transformer |
05 Feature Engineering
The process of using domain knowledge to transform raw data into features that better represent the underlying problem to the model. Feature engineering is often the highest-leverage activity in applied ML — better features beat better algorithms most of the time.
One-Hot Encoding
Convert a categorical variable with K categories into K binary columns. "City: London/Paris/Tokyo" becomes three columns: is_London, is_Paris, is_Tokyo. This prevents the model from incorrectly treating categories as ordered (treating "Tokyo" as 3× "London" just because it's category 3).
Curse: if a category has 10,000 values (postcodes), you get 10,000 columns. Use embeddings or target encoding for high-cardinality categoricals.
Feature Scaling
Standardisation (Z-score): subtract mean, divide by std → zero mean, unit variance. Min-max normalisation: scale to [0,1]. Many algorithms (SVMs, neural nets, kNN) are sensitive to feature scale — a feature ranging 0–1,000,000 will dominate a feature ranging 0–1. Decision trees and random forests are scale-invariant.
Always fit the scaler on training data only, then apply to validation and test. If you compute mean/std on the full dataset, you've leaked test statistics into training.
Embeddings
Dense vector representations of discrete objects (words, users, products). A word embedding maps "king" to a 300-dimensional vector where similar words are nearby in vector space. Learned jointly with the model or pre-trained separately. Key property: algebraic meaning — king − man + woman ≈ queen.
Embeddings are how ML handles high-cardinality categorical data and non-numeric data (text, graphs) without exploding dimensionality.
06 Preprocessing
Missing Values
Deletion: drop rows or columns with missing data (loses information; okay if <1% missing). Mean/median imputation: fill with column mean/median (distorts variance; simple). Model imputation: predict missing values using other features (KNN imputation, iterative imputation). Indicator flag: add a binary "was_missing" column (tells model that missingness itself carries signal).
Missing data is often non-random (MNAR: Missing Not At Random). A patient refusing to answer a health question may itself be informative.
Outliers
Extreme values far from the typical range. Can be genuine (billionaire in income data) or errors (sensor malfunction). Detection: Z-score >3, IQR method, isolation forests. Treatment: clip (Winsorisation), log-transform skewed distributions, remove if certain errors, or use robust algorithms (median-based, tree models).
Outliers in the target variable (y) are especially dangerous for regression — one extreme point can completely skew the learned line.
07 Bias–Variance Tradeoff
The most fundamental tension in machine learning. Total expected error = Bias² + Variance + Irreducible Noise.
Bias
Error from incorrect assumptions about the data. A high-bias model is too simple — it can't capture the real pattern. Underfitting: poor performance on both training and test data. Example: fitting a straight line to data that's fundamentally curved. The model makes systematic, predictable errors regardless of how much data you give it.
High bias models fail not because they've learned the wrong thing, but because they can't learn enough. More data doesn't help — you need a more expressive model.
Variance
Error from sensitivity to small fluctuations in training data. A high-variance model is too complex — it memorises training data including noise. Overfitting: excellent performance on training, poor on test. If you retrained on a slightly different sample, you'd get a very different model. Deep neural networks without regularisation are notorious for high variance.
High variance models fail because they've learned the training data, not the underlying pattern. More data usually helps. Regularisation is the other fix.
⚖
The tradeoff: Reducing bias usually increases variance (use a more complex model → it can fit more patterns but is more sensitive to noise). Reducing variance usually increases bias (constrain/simplify the model → more robust but less expressive). The art of ML is finding the sweet spot for your dataset size and problem complexity.
08 Loss Functions
The loss (or cost) function measures how wrong the model's predictions are. Training = minimising the loss. Your choice of loss function shapes what the model optimises for, which directly affects behaviour.
Mean Squared Error (MSE)regression
Average of squared prediction errors. Squaring has two effects: errors are always positive (so negative and positive errors don't cancel), and large errors are penalised disproportionately (a 10× larger error becomes 100× larger in the loss). This sensitivity to outliers can be a feature or a bug depending on the problem.
If you care more about large errors than small ones, MSE is natural. If outliers shouldn't dominate, use MAE (Mean Absolute Error) instead.
MSE = (1/n) Σ (yi − ŷi)2
Cross-Entropy Lossclassification, aka log loss
Measures the difference between two probability distributions — the model's predicted probabilities and the true labels. When the model is confident and correct, loss is near zero. When the model is confident and wrong, loss is very large. This asymmetry is what makes cross-entropy work so well: it severely punishes confident wrong predictions.
Binary cross-entropy is used for two classes; categorical cross-entropy for multiple classes. If your model outputs a raw score (not a probability), you need a softmax layer first.
CE = −Σ yi · log(ŷi)
Hinge LossSVMs, classification
Used in Support Vector Machines. Zero loss when the prediction is correct with sufficient margin (confidence). Positive loss when the prediction is wrong or correct but too close to the boundary. This encourages a decision boundary that sits far from any training examples — the maximum-margin classifier.
09 Optimisation
Gradient Descent
The workhorse of ML. Repeatedly move parameters in the direction of the negative gradient of the loss — i.e., in the direction that most reduces loss. Full-batch GD: use all data to compute each gradient (accurate but slow). Stochastic GD (SGD): use one example at a time (fast but noisy). Mini-batch GD: use a batch of 32–256 examples (the practical default).
The loss landscape in deep networks is a high-dimensional surface with many local minima, saddle points, and flat regions. SGD's noise often helps escape bad local minima.
Learning Rate
Controls how big a step to take in the direction of the gradient. Too high → oscillate wildly, overshoot minima, training diverges. Too low → training is glacially slow, gets stuck in local minima. The most important hyperparameter in deep learning. Learning rate schedules (decay, cosine annealing, warmup) adapt it during training.
A common debugging technique: plot the loss vs. learning rate for a short training run ("LR range test"). Use the rate where the loss drops fastest, before it explodes.
Momentum
Adds a fraction of the previous gradient update to the current one — like a ball rolling downhill that keeps its velocity. Smooths the optimisation path, reduces oscillation in "ravines" (narrow valleys in the loss landscape), and helps escape shallow local minima. Used in SGD+Momentum and is a core component of Adam.
Adam Optimiser
The default optimiser for most deep learning. Combines momentum (first moment — tracks gradient direction) with adaptive learning rates per parameter (second moment — reduces learning rate for frequently updated parameters, increases it for rarely updated ones). Self-tunes, requires minimal hyperparameter search, fast convergence. Almost always a good starting point.
Adam sometimes generalises slightly worse than SGD+Momentum — the adaptive rates can cause it to "memorise" training data. Some practitioners switch to SGD late in training.
Backpropagation
The algorithm for computing gradients in neural networks efficiently. Uses the chain rule of calculus to propagate error signals backwards from the output layer through all hidden layers to the input. Without backprop, training deep networks would be computationally intractable. It's the mathematical engine that makes deep learning possible.
Backprop doesn't change the network architecture or choose what to learn — it's purely a gradient computation algorithm. The learning comes from what gradient descent does with those gradients.
Batch Size
Number of training examples used in one gradient update. Large batches → more accurate gradient estimate, faster parallelism on GPUs, but may generalise worse (sharp minima). Small batches → noisy gradients, slower per-step, but often better generalisation (noise acts as implicit regularisation). Common values: 32, 64, 128, 256.
A practical rule: if your model doesn't train, try a smaller batch size. Large-batch training often requires a corresponding increase in learning rate (linear scaling rule).
10 Regularisation
Techniques that reduce overfitting by constraining model complexity — penalising complexity, adding noise, or averaging multiple models.
L1 RegularisationLasso
Adds sum of absolute values of weights to the loss (λ·Σ|w|). Encourages sparsity — many weights go to exactly zero, effectively eliminating features. The model performs automatic feature selection. When you have many features and expect most to be irrelevant, L1 is a good choice.
L2 RegularisationRidge, Weight Decay
Adds sum of squared weights to the loss (λ·Σw²). Penalises large weights, distributing the influence across all features. Weights shrink toward zero but rarely reach exactly zero. Works well when many features each have small predictive power. Standard default for neural networks — the λ term is called "weight decay".
Dropout
During training, randomly set a fraction of neurons' outputs to zero on each forward pass (typically 20–50% of neurons). Forces the network to learn redundant representations — any neuron can be absent, so every neuron must be useful independently. At inference, all neurons are active but outputs are scaled to compensate. One of the most effective regularisers for deep networks.
Dropout is equivalent to training an exponential number of thinned networks and averaging them at inference — an implicit ensemble.
Early Stopping
Monitor validation loss during training. Stop when it stops improving (or starts increasing) even if training loss is still decreasing. This is the point where the model begins to overfit. The "best" checkpoint is saved and used as the final model. Simple, free regularisation that should almost always be used.
Early stopping is a form of hyperparameter optimisation — the number of training epochs is implicitly chosen by the validation curve.
11 Supervised Algorithms
Supervised · Regression & Classification
Linear / Logistic Regression
Linear regression: predict continuous output as weighted sum of inputs. Logistic regression: apply sigmoid to linear output for probability of class membership. Interpretable, fast, great baseline. Assumes linear relationship between features and output.
Use when: interpretability matters, few features, linear separability
Supervised · Classification
k-Nearest Neighbours (kNN)
Predict by finding the k most similar training examples and taking a majority vote (classification) or average (regression). No training phase — the entire dataset is the "model." Distance-based, so feature scaling is critical.
Use when: small dataset, non-linear boundary, need simple baseline
Supervised · Classification
Support Vector Machine (SVM)
Find the hyperplane that maximises the margin between classes. The "kernel trick" maps data to higher dimensions where it becomes linearly separable. Very effective for high-dimensional data (text). Sensitive to scaling.
Use when: text classification, high-dimensional sparse features
Supervised · Both
Decision Tree
Recursively split data on the feature that best separates classes at each node. Perfectly interpretable (literally a flowchart). High variance — tends to overfit. The atomic unit of ensemble methods (Random Forest, Gradient Boosting).
Use when: interpretability is key, mixed feature types, missing values
Supervised · Both
Naïve Bayes
Apply Bayes' theorem, assuming features are conditionally independent given the class. "Naïve" because the independence assumption is usually violated. Despite this, works remarkably well for text classification. Extremely fast to train and predict.
Use when: text/spam filtering, tiny datasets, real-time prediction
12 Unsupervised Algorithms
Unsupervised · Clustering
K-Means
Partition data into K clusters by iteratively assigning points to the nearest centroid, then updating centroids. K must be chosen in advance. Assumes spherical clusters. Fast but sensitive to initialisation and outliers. Pick K using the elbow method or silhouette score.
Use when: customer segmentation, image compression, data exploration
Unsupervised · Dimensionality Reduction
PCA (Principal Component Analysis)
Find orthogonal directions of maximum variance in the data (principal components). Project data onto the top K components, discarding the rest. Linear, interpretable, fast. Used for visualisation (reduce to 2D), compression, and as a preprocessing step.
Use when: visualise high-dimensional data, remove correlated features, noise reduction
Unsupervised · Clustering
DBSCAN
Density-Based Spatial Clustering. Groups points that are closely packed, marks outliers as noise. Doesn't require K to be specified. Finds arbitrary-shaped clusters. Two parameters: epsilon (neighbourhood radius) and min_samples.
Use when: unknown number of clusters, need outlier detection, non-spherical shapes
Unsupervised · Dimensionality Reduction
t-SNE / UMAP
Non-linear dimensionality reduction for visualisation. Preserves local structure — nearby points in high-D stay nearby in 2D. t-SNE is slower; UMAP is faster and better preserves global structure. Used almost exclusively for visualisation, not as a preprocessing step for models.
Use when: visualising embeddings, understanding cluster structure, exploring data
13 Ensemble Methods
Combine multiple models to get better performance than any single model. The key insight: diverse models that fail independently can be combined to cancel each other's errors.
BaggingBootstrap Aggregating
Train multiple models on different random subsets of the training data (with replacement). Average their predictions (regression) or take a vote (classification). Reduces variance. Works best with high-variance base learners like deep decision trees. Random Forest extends this by also randomly selecting features at each split.
Random Forest is one of the most reliable algorithms in practical ML — good out of the box, robust to overfitting, handles missing values and mixed types well.
Boosting
Train models sequentially — each model focuses on the examples the previous model got wrong by upweighting those errors. The final prediction is a weighted sum of all models. Reduces bias. The current dominant method for tabular data. Key implementations: AdaBoost (original), Gradient Boosting, XGBoost, LightGBM, CatBoost.
XGBoost/LightGBM win most tabular data competitions and outperforms neural networks on structured data. If you're ignoring gradient boosting for a tabular problem, you're leaving performance on the table.
Stacking
Train a "meta-model" that takes the predictions of several base models as its input features. The base models learn different aspects of the data; the meta-model learns how to best combine their knowledge. Most powerful ensemble approach but expensive to train and tune.
14 Neural Networks
Neuron (Perceptron)
The basic unit. Takes multiple inputs, multiplies each by a weight, sums them, adds a bias, then passes through an activation function. The activation introduces non-linearity — without it, a stack of linear layers is still just linear. One neuron can only learn a linear boundary; a network of neurons can approximate any function (Universal Approximation Theorem).
Activation Functions
ReLU (max(0,x)): fast, avoids vanishing gradient, default choice. Sigmoid: squashes to (0,1), good for output probability in binary classification, bad in hidden layers (vanishing gradient). Tanh: squashes to (-1,1), better than sigmoid for hidden layers. GELU/SiLU: smooth ReLU variants used in Transformers. Softmax: normalises K outputs to a probability distribution — always the final layer in multi-class classification.
Vanishing / Exploding Gradients
In deep networks, gradients are multiplied through many layers during backprop. If each multiplication is <1 (vanishing), gradients become exponentially small in early layers — those layers stop learning. If >1 (exploding), gradients become exponentially large — training diverges. Solutions: ReLU activations, batch normalisation, skip connections (ResNets), gradient clipping.
Vanishing gradients are why training networks with >10 layers was nearly impossible before 2015. ResNets solved this for images; Transformers use layer norm and residual connections for the same reason.
Convolutional Neural Network (CNN)
Designed for grid-structured data (images, time series). Uses convolutional filters — small learnable patterns that slide across the input detecting local features (edges, textures, shapes). Key properties: local connectivity (a neuron sees only a small region), weight sharing (the same filter is applied everywhere), spatial hierarchy (early layers detect simple features, later layers detect complex compositions).
Recurrent Neural Network (RNN)
Processes sequences by maintaining a hidden state that carries information from previous steps. Can theoretically learn long-term dependencies. In practice, vanilla RNNs forget distant context due to vanishing gradients. LSTM and GRU architectures use gating mechanisms to selectively remember or forget. Largely superseded by Transformers for NLP.
Batch Normalisation
Normalise the activations of each layer to have zero mean and unit variance across a mini-batch, then learn per-layer scale and shift parameters. Reduces internal covariate shift, allows higher learning rates, reduces sensitivity to weight initialisation, provides slight regularisation. A standard component in modern deep networks.
Without batch norm, tuning deep networks was extremely finicky. With it, training became more stable and faster. Transformers use Layer Norm instead (normalise across features rather than batch).
15 Evaluation Metrics
| Metric | Formula | When to Use | Pitfall |
| Accuracy |
Correct / Total |
Balanced classes, simple reporting |
Useless on imbalanced datasets |
| Precision |
TP / (TP + FP) |
Cost of false positive is high (spam filter) |
Ignores false negatives |
| Recall (Sensitivity) |
TP / (TP + FN) |
Cost of false negative is high (cancer detection) |
Ignores false positives |
| F1 Score |
2·(P·R)/(P+R) |
Balance precision & recall; imbalanced data |
Doesn't distinguish between P and R |
| AUC-ROC |
Area under TPR vs FPR curve |
Ranking quality, threshold-invariant comparison |
Misleading on highly imbalanced data |
| RMSE |
√(MSE) |
Regression, same units as target |
Sensitive to outliers (like MSE) |
| MAE |
Mean |y − ŷ| |
Regression, outlier-robust reporting |
Less sensitive to large errors than RMSE |
| R² (R-squared) |
1 − SS_res/SS_tot |
Variance explained, relative comparison |
Can be negative; always increases with features |
Confusion Matrix
A 2×2 (or K×K) table showing prediction outcomes. True Positives (TP): correctly predicted positive. True Negatives (TN): correctly predicted negative. False Positives (FP): predicted positive, actually negative (Type I error). False Negatives (FN): predicted negative, actually positive (Type II error). Every classification metric derives from this table.
Always look at the confusion matrix, not just the summary metric. A model might have 95% accuracy but never correctly predict the minority class.
16 Validation Strategies
K-Fold Cross-Validation
Split data into K equal folds. Train on K-1 folds, validate on the remaining fold. Repeat K times so every fold serves as validation once. Report average performance across folds. Gives a more reliable estimate than a single train/validation split — especially for small datasets. Common values: K=5 or K=10.
For time series data, use TimeSeriesSplit — always validate on data that comes after training data in time. Never shuffle time series data before splitting.
Leave-One-Out CV (LOOCV)
K-fold where K = n (number of examples). Train on n-1 examples, validate on the one left out. Repeat n times. Nearly unbiased estimate of true error but computationally expensive (n full training runs). Practical only for small datasets or very fast models.
Stratified Sampling
Ensure each fold contains approximately the same proportion of each class as the full dataset. Critical for imbalanced problems — random splits might put all rare examples in one fold. Stratified K-fold is the default for classification; regular K-fold works for regression.
17 The ML Pipeline
1. Define Problem
Classification, regression, clustering? Success metric? Business goal?
↓
2. Collect Data
Volume, quality, bias. Check distributions and label quality.
↓
3. Explore (EDA)
Distributions, correlations, outliers, missing values, class balance.
↓
4. Preprocess
Clean, encode, scale, split into train/val/test. Fit on train only.
↓
5. Baseline Model
Start simple: logistic regression, random forest. Know what you're beating.
↓
6. Feature Engineering
Domain knowledge. Interaction terms. Transformations. Often highest leverage.
↓
7. Model Selection
Try multiple architectures. Compare on validation set.
↓
8. Hyperparameter Tuning
Grid search, random search, Bayesian optimisation. Use val set.
↓
9. Final Evaluation
Evaluate best model on test set. Once. Report this number.
↓
10. Deploy & Monitor
Data drift, model drift, latency, feedback loops.
18 Common Pitfalls
| Pitfall | Symptom | Fix |
| Data leakage | Suspiciously high accuracy; fails in production | Ensure test set is never used during feature engineering or normalisation |
| Overfitting | High train accuracy, low val accuracy | Regularisation, more data, simpler model, dropout, early stopping |
| Underfitting | Low accuracy on both train and val | More complex model, more features, reduce regularisation |
| Wrong metric | Model looks great but business goal not met | Define the metric before training; align with stakeholders |
| Class imbalance ignored | Predicts majority class always; high accuracy, zero recall for minority | Stratified sampling, class weights, appropriate metric (F1, AUC) |
| Not scaling features | SVMs, neural nets, kNN perform poorly | Standardise/normalise; fit scaler on train only |
| Test set peeking | Optimistic test score; real performance worse | Lock test set; use validation set for all development decisions |
| Ignoring baselines | "ML model" that doesn't beat a simple heuristic | Always compare against majority class, mean prediction, or rule-based system |