V1
Back to handbooks index
Classic ML
DOC-ML-001
Field Handbook · Classic ML

Machine
Learning

Essentials

The foundational reference for classical machine learning — algorithms, evaluation metrics, and end-to-end model workflow. From linear regression to cross-validation, covered precisely.

Supervised Learning Evaluation Metrics Ensemble Methods Clustering Model Workflow Hyperparameter Tuning
📈
01 // ALGORITHM

Linear Regression

// CONTINUOUS OUTPUT · SUPERVISED

Linear regression models the relationship between a dependent variable y and one or more independent variables X by fitting a line (or hyperplane) that minimizes prediction error. It's the bedrock of supervised regression tasks and the conceptual foundation for many advanced methods.

Simple Linear Regression
Supervised

Predicts a continuous output by learning the best-fit line through training data. "Best fit" is defined as minimizing the sum of squared residuals (OLS). The result is a weight vector — one coefficient per feature plus a bias term.

Hypothesis
ŷ = β₀ + β₁x₁ + β₂x₂ + … + βₙxₙ
Cost Function (MSE)
J(β) = (1/2m) Σ (ŷᵢ − yᵢ)²
When to use
Continuous numeric output; linear relationship suspected
Key assumptions
Linearity, homoscedasticity, no multicollinearity, residuals normal
Complexity
Train O(n·p²), Predict O(p) where p = features
Regularization Critical
  • Ridge (L2): Adds λ·Σβ² penalty — shrinks all coefficients, none to exactly zero. Use when all features may contribute.
  • Lasso (L1): Adds λ·Σ|β| penalty — drives some to zero. Built-in feature selection. Use when sparsity is expected.
  • ElasticNet: Combines L1 + L2. Best of both worlds for correlated features.
Key Metrics Regression
  • MSE: Mean Squared Error — penalizes large errors heavily
  • RMSE: √MSE — same units as target variable
  • MAE: Mean Absolute Error — more robust to outliers
  • R²: Proportion of variance explained (0–1; higher = better)
  • Adj. R²: R² penalized for extra features
Python — sklearn
from sklearn.linear_model import LinearRegression, Ridge, Lasso from sklearn.metrics import mean_squared_error, r2_score import numpy as np # Fit model = LinearRegression() model.fit(X_train, y_train) # Predict & evaluate y_pred = model.predict(X_test) rmse = np.sqrt(mean_squared_error(y_test, y_pred)) r2 = r2_score(y_test, y_pred) print(f"RMSE: {rmse:.3f} | R²: {r2:.3f}") # With regularization ridge = Ridge(alpha=1.0) lasso = Lasso(alpha=0.1, max_iter=10000) ridge.fit(X_train, y_train) lasso.fit(X_train, y_train)
🎯
02 // ALGORITHM

Logistic Regression

// CLASSIFICATION · PROBABILISTIC OUTPUT

Despite the name, logistic regression is a classification algorithm. It passes the linear combination of inputs through a sigmoid function to output a probability between 0 and 1. The decision boundary is a hyperplane, making it a linear classifier.

Sigmoid Function
σ(z) = 1 / (1 + e⁻ᶻ) where z = β₀ + β₁x₁ + … + βₙxₙ
Loss (Binary Cross-Entropy)
L = −(1/m) Σ [ yᵢ log(ŷᵢ) + (1−yᵢ) log(1−ŷᵢ) ]
Binary Classification

Two classes: output probability ≥ 0.5 → Class 1, else Class 0. Threshold is tunable — lower it for higher recall (catching more positives), raise it for higher precision.

Multi-class Extensions
  • One-vs-Rest (OvR): Train k binary classifiers
  • Softmax Regression: Generalization using softmax activation — outputs sum to 1 across all classes
  • multi_class='multinomial' in sklearn
Assumptions: Logistic regression assumes little or no multicollinearity among features, linearity of independent variables and log-odds, and a large enough sample size (~10 observations per feature). Violating these degrades performance but not catastrophically.
🌿
03 // ALGORITHM

Decision Trees

// NON-LINEAR · INTERPRETABLE · RECURSIVE SPLITTING

Decision trees recursively partition the feature space into regions. At each internal node, the best feature and threshold are chosen to maximally reduce impurity (for classification) or variance (for regression). The result is a flowchart-like model — highly interpretable but prone to overfitting.

Gini Impurity

Measures probability of misclassifying a randomly chosen element. G = 1 − Σ pᵢ². Ranges 0 (pure) to 0.5 (maximally impure for binary). Default in sklearn.

Entropy / Info Gain

H = −Σ pᵢ log₂(pᵢ). Information Gain = parent entropy − weighted child entropy. Splits that create pure child nodes maximize gain.

Pruning

Reduce tree depth to prevent overfitting. Pre-pruning: max_depth, min_samples_split, min_samples_leaf. Post-pruning: cost-complexity pruning (ccp_alpha in sklearn).

Overfitting risk: An unconstrained tree will memorize training data — achieving 100% training accuracy while generalizing poorly. Always set max_depth or min_samples_leaf. Use cross-validation to find optimal depth. Trees are the base learner for Random Forests and Gradient Boosting.
CART Algorithm (Classification & Regression Trees)
Classification / Regression

At each split, evaluates all features and all thresholds to find the one that minimizes the weighted impurity of the two child nodes. Produces a binary tree (each split has exactly 2 branches).

Training Complexity
O(n · p · log n) approximately
Key Hyperparameters
max_depth, min_samples_split, min_samples_leaf, max_features
Strengths
No scaling needed, handles mixed types, interpretable, captures non-linearity
🌳
04 // ALGORITHM

Random Forests

// ENSEMBLE · BAGGING · HIGH VARIANCE REDUCTION

Random Forests are ensemble methods that aggregate many decision trees trained on random subsets of data and features. By averaging diverse trees, they dramatically reduce the variance (overfitting) of individual trees while maintaining low bias. One of the most reliable off-the-shelf algorithms in classical ML.

How It Works
  • Bootstrap sampling: Each tree is trained on a random sample drawn with replacement (≈63% of data)
  • Feature randomness: At each split, only a random subset of features (√p for classification) are considered
  • Aggregation: Classification → majority vote; Regression → mean prediction
  • Out-of-bag (OOB): Samples not in bootstrap = free validation set
Feature Importance

Measures how much each feature reduces impurity across all trees. Provides a ranked list of predictive power. Useful for feature selection, but can be biased toward high-cardinality or continuous features. Use SHAP values for more reliable attribution in production.

Python — Random Forest
from sklearn.ensemble import RandomForestClassifier import pandas as pd rf = RandomForestClassifier( n_estimators=200, # more trees → more stable (diminishing returns ~200–500) max_depth=None, # None = grow fully, then average out variance max_features='sqrt', # √p features per split (classification default) oob_score=True, # free validation estimate n_jobs=-1, # parallelize across all cores random_state=42 ) rf.fit(X_train, y_train) print(f"OOB Score: {rf.oob_score_:.4f}") # Feature importances importances = pd.Series(rf.feature_importances_, index=feature_names) importances.sort_values(ascending=False).head(10)
🔍
05 // ALGORITHM

K-Nearest Neighbors

// INSTANCE-BASED · NON-PARAMETRIC · LAZY LEARNING

KNN makes predictions by finding the k most similar training examples to a query point and aggregating their labels. There is no training phase — the model is the data itself. Simple, interpretable, and surprisingly powerful, but slow at prediction time for large datasets.

K-Nearest Neighbors
Classification / Regression

For a new point, compute distance to all training points, find k nearest, then: classification = majority class vote; regression = mean of k neighbors' values. Distance metric and k are the critical choices.

Distance Metrics
Euclidean (default), Manhattan, Minkowski, Cosine (for text)
Choosing k
Small k → complex boundary (overfit); Large k → smooth (underfit). Use odd k for binary to avoid ties. Tune via CV.
Complexity
Train O(1); Predict O(n·p). Use KD-Tree or Ball-Tree for O(p·log n)
Feature scaling is mandatory. KNN is distance-based. A feature with range 0–1000 will dominate one with range 0–1. Always apply StandardScaler or MinMaxScaler before KNN. Also susceptible to the curse of dimensionality — performance degrades in high dimensions.
⚔️
06 // ALGORITHM

SVM Basics

// MAX MARGIN CLASSIFIER · KERNEL TRICK

Support Vector Machines find the maximum-margin hyperplane — the decision boundary that maximizes the distance to the nearest data points (support vectors) from each class. SVMs are powerful for high-dimensional data and work well even when dimensions exceed samples.

Hard vs Soft Margin
  • Hard margin: Requires perfect separation — no points inside the margin. Only works for linearly separable data.
  • Soft margin (C param): Allows some misclassifications. High C = small margin, fewer errors (risk overfit). Low C = large margin, more errors (more robust).
The Kernel Trick
  • Maps data to higher-dimensional space implicitly — without computing the transformation explicitly
  • Linear: Use for high-dimensional/text data
  • RBF (Gaussian): Default; works for most non-linear problems. Tune C and γ
  • Polynomial: For polynomial relationships
KernelFormulaUse When
LinearK(x,z) = xᵀzHigh-dim, linearly separable (text/NLP)
RBF / GaussianK(x,z) = exp(−γ‖x−z‖²)General non-linear; most common default
PolynomialK(x,z) = (xᵀz + c)ᵈKnown polynomial structure in features
SigmoidK(x,z) = tanh(αxᵀz + c)Neural-net-like; rarely used
🔮
07 // ALGORITHM

Clustering — K-Means

// UNSUPERVISED · PARTITIONAL · ITERATIVE

K-Means partitions data into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids as the mean of assigned points. It's an unsupervised algorithm — no labels needed. Convergence is guaranteed but may find a local minimum.

🎲
Step 01
Initialize
k centroids (k-means++ default)
📍
Step 02
Assign
Each point → nearest centroid
📐
Step 03
Update
Centroids = cluster means
🔄
Step 04
Repeat
Until centroids stop moving
Step 05
Converged
Output k cluster labels
Choosing K Key Decision
  • Elbow method: Plot inertia (within-cluster sum of squares) vs k. The "elbow" is where adding more clusters yields diminishing returns.
  • Silhouette score: Measures how similar a point is to its own cluster vs neighboring clusters. Range −1 to 1; higher is better.
  • Gap statistic: Compares inertia to a null reference distribution.
Limitations
  • Assumes spherical, equally-sized clusters (fails on elongated or ring shapes)
  • Sensitive to outliers — one outlier becomes its own centroid
  • Must specify k in advance
  • Results vary by initialization — run multiple times (n_init=10)
  • Alternatives: DBSCAN (density, arbitrary shapes), Gaussian Mixture Models (probabilistic)
08 // METRICS

Accuracy

// OVERALL CORRECTNESS · BASELINE METRIC

Accuracy measures the fraction of predictions the model got right. It's the most intuitive metric but can be deeply misleading on imbalanced datasets. A model that always predicts "Not Fraud" on a dataset with 99% non-fraud achieves 99% accuracy while being completely useless.

Accuracy
Number of correct predictions divided by total predictions. Works well only when classes are balanced and all error types have equal cost.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy Paradox: On imbalanced data (e.g., 95:5 class split), a model predicting only the majority class achieves 95% accuracy. Always report precision, recall, and F1 alongside accuracy. For imbalanced tasks, accuracy alone is a lie.
⚖️
09 // METRICS

Precision / Recall

// THE FUNDAMENTAL TRADEOFF

Precision and Recall measure complementary aspects of a classifier's performance and exist in a fundamental tradeoff. Adjusting the decision threshold moves you along the precision-recall curve — you cannot maximize both simultaneously without sacrificing the other.

Precision
Of all items the model predicted positive, what fraction actually were positive? High precision = few false alarms.
Precision = TP / (TP + FP)
Recall
Of all items that actually were positive, what fraction did the model catch? High recall = few misses.
Recall = TP / (TP + FN)
Optimize for Precision
  • Spam detection — false positive (blocking real email) is costly
  • Recommendation systems — irrelevant suggestions hurt UX
  • Content moderation — wrongly banning users is severe
vs
Optimize for Recall
  • Cancer screening — missing a true case is catastrophic
  • Fraud detection — missing a fraud is more costly than a false alarm
  • Security alerts — missing a real threat is unacceptable
🎵
10 // METRICS

F1 Score

// HARMONIC MEAN · BALANCED METRIC

F1 is the harmonic mean of Precision and Recall. It punishes extreme imbalances — if either precision or recall is near zero, F1 will be low regardless of the other. The go-to single metric for imbalanced classification problems.

F1 Score
Harmonic mean penalizes cases where one metric is very high and the other very low. A model with P=1.0, R=0.01 gets F1=0.02 — appropriately poor.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F-beta Score Generalized

Fβ = (1+β²) × (P×R) / (β²×P + R). When β < 1, weights precision more. When β > 1, weights recall more. F2 (β=2) used in retrieval where missing items is costly.

Macro vs Weighted F1
  • Macro F1: Unweighted average across classes — treats all classes equally
  • Weighted F1: Average weighted by class frequency — useful when class imbalance matters
  • Micro F1: Global TP/FP/FN — equals accuracy for binary tasks
🔲
11 // METRICS

Confusion Matrix

// FULL ERROR BREAKDOWN · BINARY CLASSIFICATION

The confusion matrix gives a complete breakdown of prediction outcomes — it shows not just how many were wrong, but how they were wrong. From these four numbers, all other classification metrics can be derived.

Predicted Positive
Predicted Negative
Actual Positive
TPTrue Positive
FNFalse Negative (Type II)
Actual Negative
FPFalse Positive (Type I)
TNTrue Negative
Error Type Reference
  • TP: Correctly predicted positive — model got it right
  • TN: Correctly predicted negative — model got it right
  • FP (Type I): Predicted positive, actually negative — false alarm
  • FN (Type II): Predicted negative, actually positive — missed case
Derived Metrics
  • Sensitivity / Recall / TPR: TP / (TP+FN)
  • Specificity / TNR: TN / (TN+FP)
  • Fall-out / FPR: FP / (FP+TN)
  • Balanced Accuracy: (Sensitivity + Specificity) / 2
  • MCC: Matthews Correlation Coefficient — gold standard for binary imbalanced tasks
📊
12 // METRICS

ROC-AUC

// THRESHOLD-INDEPENDENT · RANKING QUALITY

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate across all decision thresholds. AUC (Area Under the Curve) summarizes performance into a single number — representing the probability that the model ranks a random positive example higher than a random negative.

AUC Interpretation
AUC = probability that the model assigns higher score to a randomly chosen positive than a randomly chosen negative. Threshold-independent — compares model quality regardless of operating point.
AUC = 1.0 → Perfect classifier AUC = 0.9 → Excellent AUC = 0.7 → Acceptable AUC = 0.5 → Random guessing
ROC vs PR Curve: ROC-AUC is optimistic on highly imbalanced datasets because it accounts for the (large) negative class performance. For imbalanced tasks, prefer the Precision-Recall AUC — it focuses only on the positive class and gives a truer picture of rare-event detection quality.
🏗️
13 // WORKFLOW

Training Pipeline

// END-TO-END · DATA TO MODEL

A production ML training pipeline is more than fitting a model — it's a reproducible, versioned sequence from raw data to deployable artifact. Every step must be trackable and independently testable.

📦
Phase 1
Data Ingestion
Load, validate schema
🧹
Phase 2
Preprocessing
Missing vals, outliers
⚙️
Phase 3
Feature Engineering
Encode, scale, derive
✂️
Phase 4
Train/Val/Test Split
80 / 10 / 10 typical
🎯
Phase 5
Fit Model
Train on training set
📊
Phase 6
Evaluate
Metrics on holdout
Python — sklearn Pipeline
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier # Numeric preprocessing sub-pipeline numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) # Categorical preprocessing sub-pipeline categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Combine with ColumnTransformer preprocessor = ColumnTransformer(transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, cat_features), ]) # Full pipeline: preprocessing + model clf = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=200, random_state=42)) ]) clf.fit(X_train, y_train) # preprocessor fit only on train — no leakage! score = clf.score(X_test, y_test)
No data leakage: Always fit preprocessing (scalers, imputers, encoders) on training data only, then transform validation and test sets. sklearn Pipelines enforce this automatically. Fitting on the full dataset before splitting is one of the most common and damaging mistakes in ML.
🔬
14 // WORKFLOW

Validation

// BIAS-VARIANCE · OVERFITTING · GENERALIZATION

Validation is the practice of estimating how well a model will generalize to unseen data. The central challenge is the bias-variance tradeoff — a model complex enough to learn the training data well may not generalize, and one too simple misses the signal entirely.

Overfitting High Variance
  • Low training error, high validation error — memorizing rather than learning
  • Symptoms: Training accuracy ≫ validation accuracy
  • Fixes: More training data, regularization, simpler model, dropout, early stopping, data augmentation
Underfitting High Bias
  • High training error and high validation error — model too simple to capture patterns
  • Symptoms: Both training and validation accuracy are poor
  • Fixes: More complex model, more features, reduce regularization, train longer

Train / Validation / Test Split

Three-Way Split Strategy
  • Training set (60–80%): Model learns from this. Fit all parameters here.
  • Validation set (10–20%): Tune hyperparameters. Select architecture. Compare models. Do NOT use for final evaluation.
  • Test set (10–20%): Touched once, at the very end. The unbiased estimate of production performance. Peeking contaminates it.
🎛️
15 // WORKFLOW

Hyperparameter Tuning

// SEARCH STRATEGIES · OPTIMIZATION

Hyperparameters are the knobs set before training — they control the learning process itself (e.g., learning rate, tree depth, regularization strength). Unlike model parameters, they are not learned from data and must be searched or tuned externally.

Grid Search

Exhaustive search over all combinations of a pre-defined hyperparameter grid. Guaranteed to find the best combination in the grid, but computationally expensive — cost grows multiplicatively with parameters.

Random Search

Randomly samples combinations from the hyperparameter space. Empirically outperforms grid search because most hyperparameters have low importance — random search explores more unique values of the important ones. Use when budget-constrained.

Bayesian Optimization

Builds a probabilistic model of the objective function and uses it to select the most promising next configuration. Far more sample-efficient than grid or random search. Libraries: Optuna, Hyperopt, BayesianOptimization.

Python — Tuning Strategies
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV from scipy.stats import randint, uniform import optuna # --- Grid Search --- param_grid = { 'n_estimators': [100, 200, 500], 'max_depth': [5, 10, None], 'min_samples_leaf': [1, 5, 10], } grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='f1', n_jobs=-1) grid_search.fit(X_train, y_train) print(grid_search.best_params_, grid_search.best_score_) # --- Random Search (more efficient for large spaces) --- param_dist = { 'n_estimators': randint(50, 500), 'max_depth': randint(3, 30), 'max_features': uniform(0.1, 0.9), } rand_search = RandomizedSearchCV(rf, param_dist, n_iter=50, cv=5, scoring='f1', n_jobs=-1, random_state=42) rand_search.fit(X_train, y_train) # --- Bayesian Optimization with Optuna --- def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 50, 500), 'max_depth': trial.suggest_int('max_depth', 3, 30), 'max_features': trial.suggest_float('max_features', 0.1, 1.0), } model = RandomForestClassifier(**params, random_state=42) return cross_val_score(model, X_train, y_train, cv=5, scoring='f1').mean() study = optuna.create_study(direction='maximize') study.optimize(objective, n_trials=100) print(study.best_params)
🔁
16 // WORKFLOW

Cross-Validation

// ROBUST ESTIMATION · EVERY SAMPLE AS VALIDATION

Cross-validation provides a more reliable estimate of model performance than a single train/val split by rotating which portion of data is used for validation. It uses every sample for both training and validation, giving a lower-variance estimate of generalization error.

1️⃣
Fold 1
Val: Fold 1
Train: 2,3,4,5
2️⃣
Fold 2
Val: Fold 2
Train: 1,3,4,5
3️⃣
Fold 3
Val: Fold 3
Train: 1,2,4,5
📐
Result
Average Scores
Mean ± Std across folds
CV Variants
  • k-Fold CV: Split into k equal folds, rotate validation fold. k=5 or k=10 are standard.
  • Stratified k-Fold: Preserves class proportions in each fold. Use for classification, especially imbalanced.
  • Leave-One-Out (LOO): k = n. Low bias, very high variance + expensive. Use only for tiny datasets.
  • Time-Series CV: Always train on past, validate on future. Never shuffle time series data.
  • Repeated k-Fold: Run k-fold multiple times with different shuffles — more reliable estimate.
Interpreting CV Results
  • Mean score: Expected performance on unseen data
  • Standard deviation: Stability of the model — high std = unstable; consider more data or simpler model
  • Rule of thumb: Report mean ± std. "F1 = 0.87 ± 0.03 (5-fold CV)" is a complete, honest result.
  • Final model: After CV-based selection, retrain on ALL training data, evaluate once on test set.
Python — Cross-Validation Patterns
from sklearn.model_selection import ( cross_val_score, StratifiedKFold, cross_validate, TimeSeriesSplit ) import numpy as np # Standard 5-fold stratified CV skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(clf, X_train, y_train, cv=skf, scoring='f1', n_jobs=-1) print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}") # Multiple metrics in one pass cv_results = cross_validate(clf, X_train, y_train, cv=skf, scoring=['f1', 'roc_auc', 'precision', 'recall'], return_train_score=True # detect overfitting: train >> val? ) # Time-series: always train past, validate future tscv = TimeSeriesSplit(n_splits=5) ts_scores = cross_val_score(clf, X, y, cv=tscv, scoring='roc_auc') # Nested CV for unbiased hyperparameter tuning + evaluation # Outer loop: estimate generalization error # Inner loop: hyperparameter search inner_cv = StratifiedKFold(n_splits=3) outer_cv = StratifiedKFold(n_splits=5) search = RandomizedSearchCV(clf, param_dist, cv=inner_cv, n_iter=20) nested = cross_val_score(search, X_train, y_train, cv=outer_cv, scoring='f1') print(f"Nested CV F1: {nested.mean():.3f} ± {nested.std():.3f}")
Algorithm comparison checklist: When comparing two models with CV scores, consider the variance. If Model A achieves F1 = 0.85 ± 0.08 and Model B achieves F1 = 0.84 ± 0.02, Model B may be the better production choice despite a lower mean — its predictions are more reliable. Use paired statistical tests (e.g., Wilcoxon signed-rank) when differences are small.
Key Resources: scikit-learn User Guide · Hastie et al. "Elements of Statistical Learning" · ISLR (James et al., free PDF) · Bishop "Pattern Recognition and ML" · fast.ai Practical Deep Learning (for applied ML/DL) · Kaggle — practice on real datasets with real metric leaderboards.