Field Handbook · Classic ML · v2 June 2026

Machine
Learning

Essentials

The foundational reference for classical machine learning — algorithms, evaluation metrics, model workflow, and the statistics and mathematics powering it all. From linear regression to eigenvalues, covered precisely.

Supervised Learning Evaluation Metrics Ensemble Methods Clustering Statistics Linear Algebra

📈

01 // ALGORITHM

Linear Regression

// CONTINUOUS OUTPUT · SUPERVISED

Linear regression models the relationship between a dependent variable y and one or more independent variables X by fitting a hyperplane that minimizes prediction error. It's the bedrock of supervised regression and the conceptual foundation for many advanced methods including neural networks.

Ordinary Least Squares (OLS)

Supervised · Regression

Predicts a continuous output by learning the best-fit hyperplane through training data. "Best fit" is defined as minimizing the sum of squared residuals. The result is a weight vector — one coefficient per feature plus a bias term. Closed-form solution: β = (XᵀX)⁻¹Xᵀy.

Hypothesis

ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ = Xβ

Cost Function (MSE)

J(β) = (1/2m) Σᵢ (ŷᵢ − yᵢ)² = (1/2m) ‖Xβ − y‖²

When to use

Continuous numeric output; linear relationship suspected

Assumptions

Linearity, homoscedasticity, no multicollinearity, normal residuals

Complexity

Train O(np²), Predict O(p) — p = features

Regularization Critical

Ridge (L2): Adds λ·Σβ² penalty — shrinks all coefficients, none to exactly zero. Preferred when all features contribute.
Lasso (L1): Adds λ·Σ|β| penalty — drives some to exactly zero. Built-in feature selection for sparse problems.
ElasticNet: α·(L1) + (1−α)·(L2). Best for correlated features with sparsity.
Choosing λ: Cross-validate. RidgeCV and LassoCV do this automatically.

Key Metrics Regression

MSE: Mean Squared Error — penalizes large errors heavily. Scale-dependent.
RMSE: √MSE — same units as target. Most interpretable.
MAE: Mean Absolute Error — more robust to outliers than MSE.
R²: Proportion of variance explained (0–1; 1 = perfect). Never use alone.
Adj. R²: R² penalized for number of features. Use for model comparison.

Python — sklearn

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Fit OLS
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2   = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}  R²: {r2:.4f}")

# Regularized variants — always scale features first
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

ridge = make_pipeline(StandardScaler(), Ridge(alpha=1.0))
lasso = make_pipeline(StandardScaler(), Lasso(alpha=0.1, max_iter=10000))

ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

🎯

02 // ALGORITHM

Logistic Regression

// CLASSIFICATION · PROBABILISTIC OUTPUT

Despite the name, logistic regression is a classification algorithm. It passes the linear combination of inputs through a sigmoid function to output a probability between 0 and 1. The decision boundary is a hyperplane, making it a linear classifier. Outputs are calibrated probabilities — unlike most other classifiers.

Sigmoid Function

σ(z) = 1 / (1 + e⁻ᶻ) where z = β₀ + β₁x₁ + … + βₙxₙ

Binary Cross-Entropy Loss (Negative Log-Likelihood)

L = −(1/m) Σ [ yᵢ log(ŷᵢ) + (1−yᵢ) log(1−ŷᵢ) ]

Binary Classification

Two classes: output probability ≥ 0.5 → Class 1. Threshold is tunable — lower for higher recall (catching more positives), raise for higher precision. Coefficients as log-odds: exp(βᵢ) = odds ratio for feature i.

Multi-class Extensions

One-vs-Rest (OvR): Train k binary classifiers, pick class with highest probability
Softmax Regression: Generalizes using softmax activation — outputs sum to 1 across all k classes. multi_class='multinomial'
Regularization: C parameter in sklearn = 1/λ. Smaller C = stronger regularization.

◈

Key assumptions: Logistic regression assumes little or no multicollinearity, linearity of log-odds with continuous features, and independence of observations. Violation degrades calibration but not catastrophically. Always scale features; use C for regularization in production.

🌿

03 // ALGORITHM

Decision Trees

// NON-LINEAR · INTERPRETABLE · RECURSIVE SPLITTING

Decision trees recursively partition the feature space. At each node, the best feature and threshold are chosen to maximally reduce impurity. The result is a flowchart — highly interpretable but prone to overfitting on deep, unconstrained trees. They are the base learner for Random Forests and Gradient Boosting.

Gini Impurity

G = 1 − Σ pᵢ². Measures misclassification probability. Ranges 0 (pure) to 0.5 (max impure binary). Default criterion in sklearn. Computationally cheaper than entropy.

Entropy / Info Gain

H = −Σ pᵢ log₂(pᵢ). Information Gain = parent H − weighted child H. Splits that create pure children maximize gain. Equivalent to maximizing mutual information between feature and label.

Pruning

Pre-pruning: max_depth, min_samples_split, min_samples_leaf — set before training. Post-pruning: cost-complexity pruning (ccp_alpha in sklearn) — remove branches that don't improve generalization.

⚡

Overfitting risk: An unconstrained tree memorizes training data — achieving 100% training accuracy while failing on test data. Always set max_depth or min_samples_leaf. Use CV to find optimal depth. The real value of trees is as interpretable baselines and building blocks for ensembles.

🌳

04 // ALGORITHM

Random Forests

// ENSEMBLE · BAGGING · VARIANCE REDUCTION

Random Forests aggregate many decision trees trained on random data and feature subsets. By averaging diverse trees, they dramatically reduce variance while maintaining low bias. One of the most reliable off-the-shelf algorithms — high performance, robust to outliers, and provides feature importance out of the box.

How It Works

Bootstrap sampling: Each tree trains on a random sample with replacement (~63% unique samples)
Feature randomness: At each split, only √p features (classification) or p/3 (regression) considered — forces diversity
Aggregation: Classification → majority vote; Regression → mean prediction
Out-of-bag (OOB): The ~37% unused samples form a free validation set. oob_score=True

Feature Importance & Tuning

MDI importance: Mean decrease in impurity across all trees. Fast but biased toward high-cardinality features.
Permutation importance: Shuffle each feature, measure score drop. Slower but more reliable — use on val set.
SHAP values: Tree-SHAP is exact and fast — gold standard for attribution.
Key hyperparameters: n_estimators (≥100), max_features, max_depth, min_samples_leaf

Python — sklearn

from sklearn.ensemble import RandomForestClassifier, HistGradientBoostingClassifier
from sklearn.inspection import permutation_importance

rf = RandomForestClassifier(
    n_estimators=200,
    max_features='sqrt',      # √p features per split
    max_depth=None,           # let trees grow (pruned by min_samples_leaf)
    min_samples_leaf=2,
    oob_score=True,           # free validation estimate
    n_jobs=-1, random_state=42
)
rf.fit(X_train, y_train)
print(f"OOB score: {rf.oob_score_:.4f}")

# Permutation importance (more reliable than .feature_importances_)
result = permutation_importance(rf, X_val, y_val, n_repeats=10, random_state=42)
# result.importances_mean — mean drop in score per feature

# For tabular data with NaN, use HistGradientBoosting (sklearn's LightGBM-style)
hgb = HistGradientBoostingClassifier(max_iter=300, learning_rate=0.05, max_depth=5)
hgb.fit(X_train, y_train)

🔍

05 // ALGORITHM

K-Nearest Neighbors

// LAZY LEARNER · NON-PARAMETRIC · DISTANCE-BASED

KNN makes predictions by finding the k most similar training examples (by distance) and aggregating their labels. It's a lazy learner — no training step, no parameters to learn. All computation happens at prediction time. Simple, interpretable, and surprisingly effective on low-dimensional data with meaningful distance metrics.

How It Works

Classification: Majority vote of k nearest neighbors
Regression: Mean (or weighted mean) of k nearest neighbor values
Distance metric: Euclidean by default. Manhattan for high-dim; Cosine for text/sparse
Weighting: weights='distance' gives closer neighbors more influence — usually better

Choosing k

Small k = low bias, high variance (can overfit)
Large k = high bias, low variance (can underfit, ignores local structure)
Rule of thumb: k = √n, then tune with CV
Always use odd k for binary classification (avoids ties)
Curse of dimensionality: Distance becomes meaningless in high-dim space. Use PCA or feature selection first.

⚡

Scale features first: KNN is fully distance-based, so features on larger scales dominate. Always StandardScaler or MinMaxScaler before KNN. Prediction complexity O(n·p) per query — slow for large datasets; use algorithm='ball_tree' or 'kd_tree' for speedup.

⚔️

06 // ALGORITHM

Support Vector Machines

// MAXIMUM MARGIN · KERNEL TRICK · ROBUST

SVMs find the hyperplane that maximizes the margin between classes. Only the training points closest to the boundary — support vectors — determine the hyperplane. This makes SVMs robust to outliers far from the boundary. The kernel trick maps data to higher-dimensional spaces implicitly, enabling non-linear classification without computing the transformation.

Objective (Hard Margin)

Maximize 2/‖w‖ subject to: yᵢ(wᵀxᵢ + b) ≥ 1 for all i

Soft Margin (C-SVM) — practical version

Minimize ½‖w‖² + C·Σ ξᵢ — C controls bias-variance tradeoff

Kernel Trick

Maps data to higher-dimensional space implicitly — never computes the transformation
Linear: Use for high-dim/text data where classes are linearly separable
RBF (Gaussian): Default — works for most non-linear problems. Tune C and γ
Polynomial: For polynomial feature interactions

C and γ Tuning

C (regularization): Low C = wide margin (may misclassify); High C = narrow margin (may overfit)
γ (RBF bandwidth): High γ = tight fit (overfit); Low γ = smooth boundary (underfit)
Search C in [0.001, 0.01, 0.1, 1, 10, 100]; γ in [0.001, 0.01, 0.1, 1]
Use GridSearchCV or Bayesian optimization

Kernel	Formula	Use When
Linear	`K(x,z) = xᵀz`	High-dim, text/NLP, linearly separable
RBF / Gaussian	`K(x,z) = exp(−γ‖x−z‖²)`	General non-linear; most common default
Polynomial	`K(x,z) = (γxᵀz + r)ᵈ`	Known polynomial feature relationships
Sigmoid	`K(x,z) = tanh(γxᵀz + r)`	Rarely used; neural-net analogies

🔮

07 // ALGORITHM

Clustering — K-Means

// UNSUPERVISED · PARTITIONAL · ITERATIVE

K-Means partitions data into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids as cluster means. It's unsupervised — no labels needed. Convergence is guaranteed but may find a local minimum. Always run multiple initializations (n_init=10+) and use k-means++ initialization.

🎲

Step 01

Initialize

k centroids (k-means++)

→

📍

Step 02

Assign

Each point → nearest centroid

→

📐

Step 03

Update

Centroid = cluster mean

→

🔄

Step 04

Repeat

Until convergence

→

✅

Step 05

Converged

k cluster labels

Choosing K Key Decision

Elbow method: Plot inertia vs k — pick the "elbow" where adding clusters yields diminishing returns
Silhouette score: How similar a point is to its own cluster vs neighboring clusters. Range −1 to 1; maximize this.
Gap statistic: Compares inertia to a null reference distribution — statistically principled
Domain knowledge: Often the best guide for k

Limitations & Alternatives

Assumes spherical, equal-sized clusters — fails on elongated or ring shapes
Sensitive to outliers — consider removing before clustering
Must specify k in advance
DBSCAN: Density-based, arbitrary shapes, finds outliers automatically
Gaussian Mixture Models: Soft probabilistic cluster assignments
Hierarchical: No k needed; produces a dendrogram

✅

08 // METRICS

Accuracy

// OVERALL CORRECTNESS · BASELINE METRIC

Accuracy measures the fraction of predictions the model got right. It's the most intuitive metric but can be deeply misleading on imbalanced datasets. A model that always predicts "Not Fraud" on 99% non-fraud data achieves 99% accuracy while being completely useless.

Accuracy

Correct predictions divided by total. Works well only when classes are balanced and all error types carry equal cost. Otherwise: a lie.

Accuracy = (TP + TN) / (TP + TN + FP + FN)

⚡

Accuracy Paradox: On imbalanced data (95:5 class split), predicting only the majority class achieves 95% accuracy. Always accompany accuracy with precision, recall, and F1. For severely imbalanced tasks, use balanced accuracy: (Sensitivity + Specificity) / 2, or Matthew's Correlation Coefficient (MCC).

⚖️

09 // METRICS

Precision / Recall

// THE FUNDAMENTAL TRADEOFF

Precision and Recall measure complementary aspects of a classifier and exist in a fundamental tradeoff. Adjusting the decision threshold moves you along the precision-recall curve. The business cost of FP vs FN determines which you prioritize.

Precision

Of all predicted positives, what fraction were actually positive? High precision = few false alarms. Use when false positives are costly (spam filter, recommendation).

Precision = TP / (TP + FP)

Recall (Sensitivity)

Of all actual positives, what fraction were caught? High recall = few misses. Use when false negatives are costly (cancer screening, fraud detection).

Recall = TP / (TP + FN)

Optimize for Precision when…

False alarms are expensive (irrelevant ads, spam blocking)
Each positive action triggers significant cost
Users trust is fragile (wrong email is never sent)

Optimize for Recall when…

Missing a positive is dangerous (disease, fraud, intrusion)
Cost of false negative >> cost of false positive
You can tolerate more false alarms to catch everything real

🎯

10 // METRICS

F1 Score

// HARMONIC MEAN · BALANCED METRIC

The F1 score is the harmonic mean of Precision and Recall. It rewards models that are good at both — penalizing heavily when either is low. It's the go-to metric for imbalanced classification when you want a single number.

F1 Score

Harmonic mean of Precision and Recall. Range 0–1; 1 = perfect. The harmonic mean is used because it penalizes extreme imbalances between Precision and Recall more than an arithmetic mean would.

F1 = 2 · (Precision · Recall) / (Precision + Recall) = 2TP / (2TP + FP + FN)

Fβ Score — Weighted Tradeoff

Fβ weights recall β times more than precision. β=1 (standard F1), β=2 (recall matters more), β=0.5 (precision matters more). Use Fβ when the Precision-Recall tradeoff is asymmetric in business value.

Fβ = (1+β²) · Precision·Recall / (β²·Precision + Recall)

Multi-class Averaging

Macro: Average F1 per class — treats each class equally. Use when all classes matter equally.
Weighted: Average weighted by class frequency. Use for imbalanced multi-class.
Micro: Global TP/FP/FN across all classes. Equivalent to accuracy on balanced data.
f1_score(y, ŷ, average='weighted')

🗂️

11 // METRICS

Confusion Matrix

// FULL ERROR BREAKDOWN · CLASSIFICATION

The confusion matrix shows the complete breakdown of correct and incorrect predictions by class. It's the foundation for all classification metrics — precision, recall, F1, and specificity all derive from its four cells. Always examine the confusion matrix before reporting a single aggregate metric.

Actual \ Predicted	Predicted Positive	Predicted Negative
Actual Positive	TP — True Positive ✓	FN — False Negative ✗
Actual Negative	FP — False Positive ✗	TN — True Negative ✓

Derived Metrics

Sensitivity/Recall: TP / (TP+FN) — "How many positives did we catch?"
Specificity: TN / (TN+FP) — "How many negatives did we correctly reject?"
FPR (Fall-out): FP / (FP+TN) — used for ROC curve x-axis
MCC: Matthews Correlation Coefficient — robust for imbalanced classes. Range [−1, 1]; 1 = perfect.

Reading Multi-class Matrices

Diagonal cells = correct predictions — want high values
Off-diagonal = misclassifications — inspect which classes confuse the model
Row = actual class; Column = predicted class
Normalize by row (normalize='true') to see per-class recall rates regardless of class size

📊

12 // METRICS

ROC-AUC

// THRESHOLD-INDEPENDENT · RANKING QUALITY

The ROC curve plots True Positive Rate vs False Positive Rate across all decision thresholds. The Area Under the Curve (AUC) summarizes this into a single number representing the probability that the model ranks a random positive example higher than a random negative one. It's threshold-independent — a measure of ranking quality, not absolute prediction.

AUC-ROC

Probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance. Random classifier = 0.5. Perfect classifier = 1.0. Interpretation: AUC = 0.85 means the model correctly ranks 85% of positive-negative pairs.

AUC = P(score(positive) > score(negative)) [Wilcoxon-Mann-Whitney statistic]

When to Use AUC-ROC

Binary classification with probabilistic output
Comparing classifiers independent of threshold choice
Dataset is balanced (or weights compensate)
You care about ranking quality, not absolute predictions

AUC-PR for Imbalanced Data

On severely imbalanced data, ROC-AUC can be optimistically misleading
Precision-Recall AUC (Average Precision) is more informative when positives are rare
PR curve: Precision (y) vs Recall (x) — area measures how well model ranks relevant items
average_precision_score in sklearn

⚙️

13 // WORKFLOW

Training Pipeline

// END-TO-END · PREPROCESSING TO PREDICTION

A production ML pipeline chains data preprocessing and modeling into a single estimator. This prevents data leakage, enables proper cross-validation, and makes deployment a single object — not a sequence of manual steps. Always use Pipelines in production code.

🗄️

Step 01

Load Data

Raw features + labels

→

🔧

Step 02

Preprocess

Impute, scale, encode

→

✂️

Step 03

Split

Train / Val / Test

→

🏋️

Step 04

Train

Fit on train set only

→

📊

Step 05

Evaluate

CV metrics, test set

→

🚀

Step 06

Deploy

joblib.dump(pipe)

Python — Full Pipeline

from sklearn.pipeline        import Pipeline
from sklearn.compose         import ColumnTransformer
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute           import SimpleImputer
from sklearn.ensemble         import HistGradientBoostingClassifier
import joblib

num_cols = ['age', 'salary', 'tenure']
cat_cols = ['department', 'region']

num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
])
cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore')),
])

preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols),
])

full_pipe = Pipeline([
    ('prep',  preprocessor),
    ('model', HistGradientBoostingClassifier(max_iter=300)),
])

full_pipe.fit(X_train, y_train)
joblib.dump(full_pipe, 'model.pkl')   # saves entire fitted pipeline

# ⚠  Critical: fit_transform() on train; transform() only on test
# Pipeline enforces this automatically inside cross_validate()

🔬

14 // WORKFLOW

Validation Strategy

// BIAS-VARIANCE · SPLIT DESIGN · LEAKAGE

Validation is how you measure generalization. The test set must remain unseen until final evaluation — it's your one honest estimate. The validation (dev) set is used for model selection and hyperparameter tuning. Getting these boundaries wrong is the root cause of most "good in dev, bad in production" failures.

Bias-Variance Tradeoff

High bias (underfitting): Model too simple — both train and val error are high. Fix: more features, more complex model, less regularization.
High variance (overfitting): Train error low, val error high. Fix: more data, regularization, simpler model, dropout.
Target: Low val error and small train/val gap. The Goldilocks zone.
Total Error = Bias² + Variance + Irreducible Noise

Data Leakage — The Silent Killer

Temporal leakage: Future information in training data (e.g., next-day price predicting today)
Target leakage: Features that are proxies for the target computed using test-set information
Preprocessing leakage: Fitting StandardScaler on all data before splitting — test statistics bleed into training
Fix: Always fit transformers on train set only. Use sklearn Pipelines inside cross-validation.

🎛️

15 // WORKFLOW

Hyperparameter Tuning

// GRID · RANDOM · BAYESIAN OPTIMIZATION

Hyperparameters control model structure and training — they're set before fitting, not learned from data. Tuning finds the combination that maximizes cross-validated score. The method you choose depends on the search space size and compute budget.

Grid Search Exhaustive

Tries all combinations in a defined grid. Guarantees finding the best within the grid. Exponentially expensive — only practical with 2-3 hyperparameters and narrow ranges. Use GridSearchCV.

Random Search Efficient

Samples randomly from distributions over each hyperparameter. Often finds near-optimal configurations 5-10× faster than grid search. The standard approach for medium search spaces. Use RandomizedSearchCV with scipy distributions.

Bayesian Optimization Smart

Builds a probabilistic surrogate model of the objective function and uses it to select the next most promising configuration. Significantly fewer evaluations needed. Use Optuna or scikit-optimize. Best for expensive-to-evaluate models.

Python — Optuna Bayesian Tuning

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators':    trial.suggest_int('n_estimators', 50, 500),
        'max_depth':       trial.suggest_int('max_depth', 3, 30),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
        'max_features':    trial.suggest_float('max_features', 0.1, 1.0),
    }
    model = RandomForestClassifier(**params, random_state=42, n_jobs=-1)
    return cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, n_jobs=-1)
print(f"Best F1: {study.best_value:.4f}")
print(study.best_params)

🔁

16 // WORKFLOW

Cross-Validation

// ROBUST ESTIMATION · EVERY SAMPLE AS VALIDATION

Cross-validation provides a more reliable estimate of generalization than a single train/val split by rotating which data is used for validation. Every sample is validated exactly once. The result — mean ± standard deviation across folds — quantifies both performance and stability.

CV Strategies

k-Fold: k equal folds, rotate validation. k=5 or k=10 standard. General purpose.
Stratified k-Fold: Preserves class proportions in each fold. Always use for classification.
LOO: k=n. Lowest bias, highest variance, very slow. Only for tiny datasets.
Time-Series CV: Train on past, validate on future. Never shuffle. Use TimeSeriesSplit.
Group k-Fold: Ensures same patient/user/entity is not in both train and val — prevents leakage.

Interpreting CV Results

Mean score: Expected performance on unseen data
Std deviation: Model stability. High std = unstable; consider more data or simpler model
Report: "F1 = 0.87 ± 0.03 (5-fold stratified CV)" — complete, honest result
Nested CV: Outer loop evaluates generalization; inner loop selects hyperparameters — the only unbiased approach for combined selection + evaluation
Final model: After CV, retrain on ALL training data. Evaluate once on held-out test set.

Python — Cross-Validation Patterns

from sklearn.model_selection import (
    cross_val_score, cross_validate, StratifiedKFold, TimeSeriesSplit
)
import numpy as np

# Stratified 5-fold — standard for classification
skf    = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=skf, scoring='f1', n_jobs=-1)
print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

# Multiple metrics in one pass
cv_results = cross_validate(clf, X_train, y_train, cv=skf,
    scoring=['f1', 'roc_auc', 'precision', 'recall'],
    return_train_score=True   # detect overfitting: train >> val?
)

# Time-series: always train past, validate future
tscv     = TimeSeriesSplit(n_splits=5)
ts_scores = cross_val_score(clf, X, y, cv=tscv, scoring='roc_auc')

# Nested CV for unbiased hyperparameter search + evaluation
from sklearn.model_selection import RandomizedSearchCV
inner_cv = StratifiedKFold(n_splits=3)
outer_cv = StratifiedKFold(n_splits=5)
search   = RandomizedSearchCV(clf, param_dist, cv=inner_cv, n_iter=20)
nested   = cross_val_score(search, X_train, y_train, cv=outer_cv, scoring='f1')
print(f"Nested CV F1: {nested.mean():.3f} ± {nested.std():.3f}")

▸

Comparing models statistically: If Model A achieves F1 = 0.85 ± 0.08 and Model B achieves F1 = 0.84 ± 0.02, Model B may be the better production choice despite the lower mean — its predictions are more reliable. Use the Wilcoxon signed-rank test or McNemar's test when differences are small.

📐

17 // FOUNDATIONS

Statistics in Machine Learning

// DESCRIPTIVE · INFERENTIAL · PROBABILISTIC

Machine learning is applied statistics at scale. Before you can understand why a model works — or why it fails — you need fluency in descriptive statistics, probability distributions, hypothesis testing, and Bayesian thinking. These aren't optional extras; they're the formal language of ML.

//Descriptive Statistics

Descriptive statistics summarize and describe a dataset's properties. Every EDA (Exploratory Data Analysis) begins here — before you feed data to any model, you must understand its shape, center, spread, and outliers.

Mean

μ = Σxᵢ/n. The arithmetic average. Sensitive to outliers — a single extreme value can skew it significantly.

Median

Middle value when sorted. Robust to outliers. Preferred for skewed distributions (income, house prices).

σ²

Variance

σ² = Σ(xᵢ−μ)²/n. Average squared deviation from mean. Measures spread — larger = more dispersed data.

Std Dev

σ = √σ². Same units as the data. 68% of normal data falls within μ ± σ; 95% within μ ± 2σ.

γ₁

Skewness

Measures asymmetry. Positive skew = right tail (income). Negative = left tail. Zero = symmetric. Alerts you to non-normal distributions.

γ₂

Kurtosis

Measures tail heaviness. High kurtosis = heavy tails, more extreme outliers. Normal distribution: kurtosis = 3 (excess kurtosis = 0).

Python — Descriptive Statistics

import numpy as np
import pandas as pd
from scipy import stats

# All-in-one: pandas describe()
df['feature'].describe()
# → count, mean, std, min, 25%, 50%, 75%, max

# Manual descriptive stats
x = df['salary'].values
print(f"Mean:     {np.mean(x):.2f}")
print(f"Median:   {np.median(x):.2f}")
print(f"Std Dev:  {np.std(x, ddof=1):.2f}")   # ddof=1 → sample std
print(f"Variance: {np.var(x, ddof=1):.2f}")
print(f"Skewness: {stats.skew(x):.3f}")
print(f"Kurtosis: {stats.kurtosis(x):.3f}")   # excess kurtosis (Fisher)
print(f"IQR:      {np.percentile(x,75) - np.percentile(x,25):.2f}")

# Outlier detection: z-score > |3| or IQR rule
z_scores = np.abs(stats.zscore(x))
outliers_z = x[z_scores > 3]

Q1, Q3 = np.percentile(x, [25, 75])
IQR = Q3 - Q1
outliers_iqr = x[(x < Q1 - 1.5*IQR) | (x > Q3 + 1.5*IQR)]

//Probability Distributions

Distributions describe how data is generated. Knowing which distribution governs your data tells you which loss functions are appropriate, what assumptions a model is making, and how to interpret its outputs. Most ML algorithms implicitly assume a distribution — make that assumption explicit.

Continuous

Normal (Gaussian)

X ~ N(μ, σ²) PDF: (1/σ√2π) exp(−(x−μ)²/2σ²)

The bell curve. Central Limit Theorem makes it ubiquitous. Linear regression residuals, activations in neural networks, additive noise. Mean = Median = Mode. Characterized entirely by μ and σ.

Discrete

Bernoulli & Binomial

Bernoulli: P(X=1) = p Binomial: P(X=k) = C(n,k)·pᵏ·(1−p)ⁿ⁻ᵏ

Single binary trial (Bernoulli) or n independent binary trials (Binomial). Logistic regression models the Bernoulli parameter. Basis for binary cross-entropy loss. E[X] = np, Var[X] = np(1−p).

Count Data

Poisson

P(X=k) = (λᵏ e⁻λ) / k! where λ = expected count

Models event counts in a fixed interval. Clicks per hour, calls per day, defects per unit. Mean = Variance = λ. When λ is large, approaches Normal. Use Poisson regression for count targets.

Bayesian

Beta Distribution

X ~ Beta(α, β) on [0,1] E[X] = α/(α+β)

Distribution over probabilities — the "probability of a probability." Natural conjugate prior for Bernoulli/Binomial. Used in Thompson sampling, A/B testing, Bayesian classifiers. α−1 prior successes, β−1 prior failures.

//Hypothesis Testing

Hypothesis testing is how you decide if a pattern in data is real or just noise. In ML: used for feature selection, A/B testing model changes, and statistical model comparison. The framework: formulate a null hypothesis H₀ (no effect), collect evidence, compute how surprised you would be under H₀.

The Core Framework

H₀ (null hypothesis): No effect, no difference — the default assumption
H₁ (alternative): The effect you're trying to detect
p-value: P(data this extreme | H₀ is true). Low p → evidence against H₀
α (significance level): Threshold for "surprising enough" — typically 0.05. If p < α, reject H₀.
Type I error (α): Rejecting H₀ when it's true (false positive)
Type II error (β): Failing to reject H₀ when it's false (false negative)
Power (1−β): Probability of detecting a real effect when it exists

Tests Used in ML

t-test: Compare means of two groups (e.g., A/B test on model accuracy). Assumes normality — robust for n>30 by CLT.
Chi-squared (χ²): Independence of categorical features and target. Feature selection for text/categorical data.
ANOVA / F-test: Compare means across 3+ groups. Used in sklearn's SelectKBest for regression features.
Wilcoxon signed-rank: Non-parametric paired test. Better than t-test for CV score comparison.
Kolmogorov-Smirnov: Test if two samples follow the same distribution. Data drift detection.
McNemar's test: Compare two classifiers on the same test set — accounts for correlated errors.

Python — Hypothesis Tests

from scipy import stats
from sklearn.feature_selection import chi2, f_classif, SelectKBest

# t-test: do model A and B have different accuracies?
scores_a = [0.85, 0.87, 0.83, 0.86, 0.84]   # 5-fold CV
scores_b = [0.88, 0.89, 0.87, 0.90, 0.88]
t_stat, p_val = stats.ttest_rel(scores_a, scores_b)  # paired t-test
print(f"t={t_stat:.3f}, p={p_val:.4f}")  # p < 0.05 → significantly different

# Non-parametric alternative (preferred for CV scores)
stat, p_wil = stats.wilcoxon(scores_a, scores_b)

# Chi-squared feature selection (classification, categorical features)
selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X_counts, y)

# ANOVA F-test feature selection (classification, continuous features)
selector_f = SelectKBest(f_classif, k=10)
X_new_f = selector_f.fit_transform(X_scaled, y)

# Distribution test: check if train/test have same distribution (drift detection)
ks_stat, ks_p = stats.ks_2samp(X_train[:,0], X_test[:,0])
if ks_p < 0.05:
    print("Warning: distribution shift detected in feature 0")

//Bayesian Statistics & Correlation

Bayes' Theorem Fundamental

Posterior = Likelihood × Prior / Evidence

P(H|D) = P(D|H) · P(H) / P(D)

Prior P(H): Your belief before seeing data
Likelihood P(D|H): How probable is this data given hypothesis H?
Posterior P(H|D): Updated belief after seeing data
MLE: Maximizes likelihood — ignores prior. Equivalent to Ridge/Lasso (MAP with Gaussian/Laplace prior)
Naive Bayes: Assumes feature independence given class. Despite being "naive," highly effective for text.

Correlation & Covariance

Covariance: Cov(X,Y) = E[(X−μₓ)(Y−μᵧ)]. Scale-dependent — hard to interpret directly.
Pearson r: r = Cov(X,Y)/(σₓ·σᵧ). Range [−1,1]. Measures linear association. Sensitive to outliers.
Spearman ρ: Rank correlation. Robust to outliers and non-linear monotone relationships.
Point-Biserial: Correlation between a binary and continuous variable.
Multicollinearity: High correlation between features (|r| > 0.9). Makes coefficients unstable in linear models. Detect via VIF (Variance Inflation Factor).
Correlation ≠ Causation. Always.

//Central Limit Theorem & Confidence Intervals

Central Limit Theorem (CLT)

X̄ ~ N(μ, σ²/n) as n → ∞, regardless of the original distribution

The sample mean of sufficiently large independent samples approaches a normal distribution — regardless of the underlying population distribution. This is why many ML methods (and t-tests) work even when individual data points aren't normally distributed. "Sufficient" typically means n ≥ 30.

Confidence Intervals

95% CI: x̄ ± 1.96 · (σ/√n). Interpretation: if you repeated this experiment 100 times, 95 of the CIs would contain the true μ.
Margin of error: Narrows with √n — double sample size, halve the CI width
Standard Error: SE = σ/√n — std dev of the sampling distribution
Bootstrap CI: Model-free. Resample with replacement 1000× and take percentiles. No distributional assumptions needed.

ML Applications

CV uncertainty: Report mean ± 1.96·(std/√k) for a k-fold CI on model performance
A/B test sample size: n = 2·(z_α/2 + z_β)² · p(1−p) / δ² where δ = minimum detectable effect
Bootstrap importance: Bootstrap permutation importance distributions for feature significance
Calibration: Well-calibrated model: 70% confidence predictions are correct 70% of the time

◈

Statistics for ML practitioners: Must-reads — "Think Stats" (Downey, free online), "Statistical Learning Theory" (Vapnik), "Introduction to Statistical Learning" (ISLR, free PDF at statlearning.com). For Bayesian ML: "Probabilistic Machine Learning" (Murphy, free online at probml.ai).

∑

18 // FOUNDATIONS

Mathematics in Machine Learning

// LINEAR ALGEBRA · CALCULUS · OPTIMIZATION · INFORMATION THEORY

Every ML algorithm is mathematics in disguise. Linear regression is matrix least squares. Neural networks are composed functions differentiated via the chain rule. PCA is eigendecomposition. Understanding the math tells you when an algorithm will fail, how to debug it, and how to adapt it for new problems.

//Linear Algebra

Linear algebra is the language of data. Features are vectors, datasets are matrices, transformations are matrix multiplications. Intuition about matrix operations is the single most important mathematical skill for ML practitioners.

Vectors & Matrices

Vector: An ordered list of numbers — one data point in n-dimensional space. Notation: x ∈ ℝⁿ
Dot product: x·y = Σxᵢyᵢ = ‖x‖‖y‖cos(θ). Measures similarity — the core operation of attention and KNN.
Norms: L2: ‖x‖₂ = √(Σxᵢ²) — Euclidean length. L1: ‖x‖₁ = Σ|xᵢ| — Manhattan. L∞: max|xᵢ|
Matrix multiply: (AB)ᵢⱼ = Σₖ Aᵢₖ Bₖⱼ. Only valid when inner dims match: (m×k)(k×n) → (m×n)
Transpose: (Aᵀ)ᵢⱼ = Aⱼᵢ. Flips matrix. (AB)ᵀ = BᵀAᵀ.
Inverse: A⁻¹A = I. Exists only for square, full-rank matrices. Used in OLS: β = (XᵀX)⁻¹Xᵀy

Eigenvalues & SVD

Eigenvalue equation: Av = λv. Eigenvector v is unchanged in direction by A; scaled by λ.
Covariance matrix: Σ = (1/n)XᵀX. Symmetric → always real eigenvalues, orthogonal eigenvectors.
PCA: Eigenvectors of Σ are principal components (directions of max variance). Eigenvalues = variance explained.
SVD: X = UΣVᵀ. U = left singular vectors, Σ = singular values, V = right singular vectors. Works on any matrix.
Low-rank approximation: Keep top k singular values — compresses X while preserving most variance. Foundation for collaborative filtering, LSA, and matrix factorization.

Python — Linear Algebra with NumPy

import numpy as np

# Vectors
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
dot     = np.dot(x, y)              # 32 — dot product
l2_norm = np.linalg.norm(x)       # √14 ≈ 3.742
cosine  = dot / (np.linalg.norm(x) * np.linalg.norm(y))  # cosine similarity

# Matrices
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
C = A @ B                            # matrix multiply: [[19,22],[43,50]]
A_inv = np.linalg.inv(A)

# OLS solution via linear algebra
# β = (XᵀX)⁻¹Xᵀy — exact closed-form solution
beta = np.linalg.lstsq(X, y, rcond=None)[0]  # numerically stable

# Eigendecomposition (PCA)
cov_matrix = np.cov(X_scaled.T)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
explained_var = eigenvalues / eigenvalues.sum()

# SVD — works on any matrix (not just square)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
k = 10
X_reconstructed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

//Calculus — Gradients & Chain Rule

Calculus powers model training. The gradient is the multi-dimensional derivative — it points uphill in parameter space. Minimizing a loss function means following the negative gradient. Backpropagation is the chain rule applied recursively through a computational graph.

Gradient

∇J(θ) = [∂J/∂θ₁, ∂J/∂θ₂, ..., ∂J/∂θₙ]ᵀ

The gradient is a vector of partial derivatives — one per parameter. It points in the direction of steepest ascent. We subtract it to minimize the loss: θ ← θ − α·∇J(θ).

Chain Rule (Backpropagation)

dL/dw = (dL/dz) · (dz/dw)

Compose derivatives through a function chain. For a neural network: the gradient of the loss w.r.t. weights in layer l = error signal from layer l+1 × local gradient at layer l. Applied recursively to all layers.

Key Partial Derivatives in ML

MSE Loss: ∂J/∂θ = (2/m)Xᵀ(Xθ − y) — linear in θ, one global minimum
BCE Loss: ∂L/∂z = ŷ − y for logistic regression — gradient of log-loss w.r.t. linear output
ReLU: ∂/∂x = 1 if x>0, else 0 — vanishes for negative inputs (dying ReLU)
Sigmoid: dσ/dz = σ(z)(1−σ(z)) — self-referential, elegant, but saturates near 0 and 1
Softmax: ∂S/∂zᵢ = Sᵢ(δᵢⱼ − Sⱼ) — used with cross-entropy in multi-class output

Hessian & Second-Order Methods

Hessian H: Matrix of second partial derivatives. H_{ij} = ∂²J/∂θᵢ∂θⱼ
Positive definite H: Local minimum (all eigenvalues > 0)
Saddle point: Some eigenvalues positive, some negative — gradient = 0 but not a minimum. Common in deep networks.
Newton's method: θ ← θ − H⁻¹∇J. Quadratic convergence but O(n³) per step — impractical for large models.
L-BFGS: Approximate Hessian inverse. Used in LogisticRegression(solver='lbfgs').

//Gradient Descent Variants

Gradient descent is the core optimization algorithm for nearly every ML model. The variants differ in how much data they use per update and how they adapt the learning rate — each makes a different bias-variance-compute tradeoff.

Variant	Update Rule	Pros	Cons
Batch GD	θ ← θ − α·(1/m)Xᵀ(Xθ−y)	Stable, exact gradient, convex → global min	O(m) per step — slow for large datasets
Stochastic GD (SGD)	θ ← θ − α·∇J(θ; xᵢ, yᵢ) for single sample	Very fast updates, escapes saddle points	Noisy — never fully converges; needs LR schedule
Mini-Batch GD	θ ← θ − α·∇J(θ; batch of m_b samples)	GPU parallelism, stable, best of both worlds	Requires batch size tuning (typically 32–512)
Momentum	v ← βv + α∇J; θ ← θ − v	Faster convergence, damps oscillations	Extra hyperparameter β (typically 0.9)
Adam	Adaptive per-parameter learning rates (m, v estimates)	Robust default — works well across tasks	May not generalize as well as SGD with LR schedule for large models

Python — Gradient Descent from Scratch

import numpy as np

# Batch gradient descent for linear regression
def batch_gradient_descent(X, y, lr=0.01, n_iter=1000):
    m, n   = X.shape
    theta  = np.zeros(n)         # initialize weights
    losses = []

    for i in range(n_iter):
        y_pred = X @ theta          # forward pass
        error  = y_pred - y
        loss   = (1/2*m) * np.dot(error, error)    # MSE
        grad   = (1/m) * X.T @ error               # gradient ∇J
        theta  -= lr * grad         # parameter update
        losses.append(loss)

    return theta, losses

# Adam optimizer from scratch (illustrative)
def adam_step(grad, m, v, t, lr=0.001, b1=0.9, b2=0.999, eps=1e-8):
    m = b1 * m + (1-b1) * grad          # 1st moment (mean)
    v = b2 * v + (1-b2) * grad**2       # 2nd moment (variance)
    m_hat = m / (1 - b1**t)             # bias correction
    v_hat = v / (1 - b2**t)
    update = lr * m_hat / (np.sqrt(v_hat) + eps)
    return update, m, v                  # subtract update from θ

//Information Theory

Information theory quantifies uncertainty, surprise, and information content. It connects directly to loss functions — cross-entropy loss, entropy-based splitting in decision trees, and KL divergence in variational autoencoders and KD. These concepts are not academic curiosities; they appear in the definition of most ML objectives.

Shannon Entropy

H(X) = −Σₓ p(x) log₂ p(x) [bits]

Average surprise of outcomes. Pure distribution (p=0 or 1) → H=0. Maximally uncertain (p=0.5 binary) → H=1. Decision trees maximize Information Gain = H(parent) − H(children). A measure of impurity.

Cross-Entropy Loss

H(p, q) = −Σₓ p(x) log q(x)

The expected surprise when using model distribution q but true distribution is p. Minimizing cross-entropy = maximizing log-likelihood = making q as close to p as possible. The standard classification loss function. H(p,q) = H(p) + KL(p‖q).

KL Divergence

KL(P‖Q) = Σₓ P(x) log(P(x)/Q(x))

Measures how different Q is from P. Not symmetric: KL(P‖Q) ≠ KL(Q‖P)
Always ≥ 0 (Gibbs' inequality); = 0 iff P = Q
Used in: VAE regularization term (KL between posterior and prior), knowledge distillation, policy gradients (PPO, TRPO)
Forward KL (P‖Q): mean-seeking — Q spreads to cover all of P
Reverse KL (Q‖P): mode-seeking — Q collapses to a mode of P

Mutual Information

I(X;Y) = H(X) − H(X|Y) = H(Y) − H(Y|X)

How much knowing Y reduces uncertainty about X. Always ≥ 0.
Used in: feature selection (mutual_info_classif), causal discovery, representation learning
Information Gain (decision trees) = I(Feature; Class label) at each split
MINE estimator: Neural estimator for high-dimensional MI — used in contrastive learning

Python — Information Theory & Distance Metrics

import numpy as np
from scipy.special import rel_entr
from scipy.spatial.distance import cdist
from sklearn.feature_selection import mutual_info_classif

# Shannon entropy
def entropy(p):
    p = p[p > 0]               # avoid log(0)
    return -np.sum(p * np.log2(p))

# KL Divergence — D_KL(P || Q)
def kl_divergence(P, Q):
    return np.sum(rel_entr(P, Q))   # handles zeros, inf safely

# Cross-entropy loss (binary)
def binary_cross_entropy(y_true, y_pred, eps=1e-9):
    y_pred = np.clip(y_pred, eps, 1-eps)   # numerical stability
    return -np.mean(y_true*np.log(y_pred) + (1-y_true)*np.log(1-y_pred))

# Distance metrics
X = np.array([[1,2,3],[4,5,6],[7,8,9]])
euclidean  = cdist(X, X, metric='euclidean')
manhattan  = cdist(X, X, metric='cityblock')
cosine_d   = cdist(X, X, metric='cosine')   # 1 - cosine_similarity

# Mutual information feature selection
mi_scores = mutual_info_classif(X_train, y_train, discrete_features=False)
top_k     = np.argsort(mi_scores)[:-11:-1]   # top 10 features

//Distance Metrics

Distance and similarity functions determine what "close" means for KNN, K-Means, SVMs, and embedding search. Choosing the right metric for your data type is as important as choosing the right algorithm — use the wrong metric and the model learns nothing meaningful.

Metric	Formula	Best Used For	Notes
Euclidean	`√Σ(xᵢ−yᵢ)²`	Low-dim continuous features; KNN, K-Means	Scale-sensitive — standardize features first. Suffers in high dimensions.
Manhattan (L1)	`Σ\|xᵢ−yᵢ\|`	Grid-like spaces; sparse data; robust to outliers	Less affected by extreme values than L2. LASSO regularization uses L1 norm.
Cosine Similarity	`x·y / (‖x‖‖y‖)`	Text, document embeddings, high-dim sparse vectors	Ignores magnitude — only direction matters. Range [−1, 1]. 1 = identical direction.
Minkowski (Lₚ)	`(Σ\|xᵢ−yᵢ\|ᵖ)^(1/p)`	Generalization: p=1 is Manhattan, p=2 is Euclidean	p is a hyperparameter. Used in KNN as `metric_params={'p': p}`.
Mahalanobis	`√((x−y)ᵀ Σ⁻¹ (x−y))`	Correlated features; anomaly detection; accounts for feature scale and correlation	Equivalent to Euclidean after whitening transform. Requires invertible covariance matrix.
Hamming	`Fraction of positions that differ`	Binary/categorical vectors; string comparison	Used for binary feature vectors, error detection, NLP tokenizer comparison.

//Regularization as Math

Regularization prevents overfitting by adding a penalty term to the loss function. Understanding the math behind it reveals why L1 induces sparsity and L2 shrinks but doesn't zero out — and connects to Bayesian priors.

L2 Regularization (Ridge / Weight Decay)

J(θ) = MSE(θ) + λ·‖θ‖₂²

Adds squared L2 norm of weights to the loss. Gradient update: θ ← θ(1−2αλ) − α·∇MSE. The factor (1−2αλ) shrinks weights by a fraction every step — "weight decay." Bayesian interpretation: Gaussian prior on θ. Solution is smooth and differentiable at θ=0, so no weights go exactly to zero.

L1 Regularization (Lasso)

J(θ) = MSE(θ) + λ·‖θ‖₁

Adds L1 norm. Gradient is sign(θ) — a constant magnitude. This subtracts a fixed amount each step, driving small weights all the way to exactly zero (sparse solution). Bayesian interpretation: Laplace prior on θ — peaked at zero with heavier tails than Gaussian, encouraging sparsity. The geometry: L1 ball has corners at axes → optimal solutions touch corners where coordinates are zero.

▸

Mathematics resources for ML: "Mathematics for Machine Learning" (Deisenroth, Faisal, Ong — free PDF at mml-book.com) · "Deep Learning" Chapter 2-4 (Goodfellow — deeplearningbook.org) · 3Blue1Brown's "Essence of Linear Algebra" and "Essence of Calculus" (YouTube — the best visual intuition available) · Gilbert Strang's MIT 18.06 Linear Algebra (free OCW lectures).

◈

Mathematical prerequisites by algorithm: Linear Regression — linear algebra + calculus + statistics. SVMs — convex optimization + kernel theory. Decision Trees — information theory. Neural Networks — calculus + linear algebra + probability. Clustering — geometry + statistics. Bayesian Methods — probability theory + conjugate priors. Understanding the math behind an algorithm tells you its failure modes before you run a single experiment.

MachineLearning

Linear Regression

Logistic Regression

Decision Trees

Random Forests

K-Nearest Neighbors

Support Vector Machines

Clustering — K-Means

Accuracy

Precision / Recall

F1 Score

Confusion Matrix

ROC-AUC

Training Pipeline

Validation Strategy

Hyperparameter Tuning

Cross-Validation

Statistics in Machine Learning

//Descriptive Statistics

//Probability Distributions

//Hypothesis Testing

//Bayesian Statistics & Correlation

//Central Limit Theorem & Confidence Intervals

Mathematics in Machine Learning

//Linear Algebra

//Calculus — Gradients & Chain Rule

//Gradient Descent Variants

//Information Theory

//Distance Metrics

//Regularization as Math

Machine
Learning