Machine
Learning
The foundational reference for classical machine learning — algorithms, evaluation metrics, and end-to-end model workflow. From linear regression to cross-validation, covered precisely.
Linear Regression
Linear regression models the relationship between a dependent variable y and one or more independent variables X by fitting a line (or hyperplane) that minimizes prediction error. It's the bedrock of supervised regression tasks and the conceptual foundation for many advanced methods.
Predicts a continuous output by learning the best-fit line through training data. "Best fit" is defined as minimizing the sum of squared residuals (OLS). The result is a weight vector — one coefficient per feature plus a bias term.
- Ridge (L2): Adds λ·Σβ² penalty — shrinks all coefficients, none to exactly zero. Use when all features may contribute.
- Lasso (L1): Adds λ·Σ|β| penalty — drives some to zero. Built-in feature selection. Use when sparsity is expected.
- ElasticNet: Combines L1 + L2. Best of both worlds for correlated features.
- MSE: Mean Squared Error — penalizes large errors heavily
- RMSE: √MSE — same units as target variable
- MAE: Mean Absolute Error — more robust to outliers
- R²: Proportion of variance explained (0–1; higher = better)
- Adj. R²: R² penalized for extra features
Logistic Regression
Despite the name, logistic regression is a classification algorithm. It passes the linear combination of inputs through a sigmoid function to output a probability between 0 and 1. The decision boundary is a hyperplane, making it a linear classifier.
Two classes: output probability ≥ 0.5 → Class 1, else Class 0. Threshold is tunable — lower it for higher recall (catching more positives), raise it for higher precision.
- One-vs-Rest (OvR): Train k binary classifiers
- Softmax Regression: Generalization using softmax activation — outputs sum to 1 across all classes
multi_class='multinomial'in sklearn
Decision Trees
Decision trees recursively partition the feature space into regions. At each internal node, the best feature and threshold are chosen to maximally reduce impurity (for classification) or variance (for regression). The result is a flowchart-like model — highly interpretable but prone to overfitting.
Measures probability of misclassifying a randomly chosen element. G = 1 − Σ pᵢ². Ranges 0 (pure) to 0.5 (maximally impure for binary). Default in sklearn.
H = −Σ pᵢ log₂(pᵢ). Information Gain = parent entropy − weighted child entropy. Splits that create pure child nodes maximize gain.
Reduce tree depth to prevent overfitting. Pre-pruning: max_depth, min_samples_split, min_samples_leaf. Post-pruning: cost-complexity pruning (ccp_alpha in sklearn).
max_depth or min_samples_leaf. Use cross-validation to find optimal depth. Trees are the base learner for Random Forests and Gradient Boosting.At each split, evaluates all features and all thresholds to find the one that minimizes the weighted impurity of the two child nodes. Produces a binary tree (each split has exactly 2 branches).
Random Forests
Random Forests are ensemble methods that aggregate many decision trees trained on random subsets of data and features. By averaging diverse trees, they dramatically reduce the variance (overfitting) of individual trees while maintaining low bias. One of the most reliable off-the-shelf algorithms in classical ML.
- Bootstrap sampling: Each tree is trained on a random sample drawn with replacement (≈63% of data)
- Feature randomness: At each split, only a random subset of features (√p for classification) are considered
- Aggregation: Classification → majority vote; Regression → mean prediction
- Out-of-bag (OOB): Samples not in bootstrap = free validation set
Measures how much each feature reduces impurity across all trees. Provides a ranked list of predictive power. Useful for feature selection, but can be biased toward high-cardinality or continuous features. Use SHAP values for more reliable attribution in production.
K-Nearest Neighbors
KNN makes predictions by finding the k most similar training examples to a query point and aggregating their labels. There is no training phase — the model is the data itself. Simple, interpretable, and surprisingly powerful, but slow at prediction time for large datasets.
For a new point, compute distance to all training points, find k nearest, then: classification = majority class vote; regression = mean of k neighbors' values. Distance metric and k are the critical choices.
StandardScaler or MinMaxScaler before KNN. Also susceptible to the curse of dimensionality — performance degrades in high dimensions.SVM Basics
Support Vector Machines find the maximum-margin hyperplane — the decision boundary that maximizes the distance to the nearest data points (support vectors) from each class. SVMs are powerful for high-dimensional data and work well even when dimensions exceed samples.
- Hard margin: Requires perfect separation — no points inside the margin. Only works for linearly separable data.
- Soft margin (C param): Allows some misclassifications. High C = small margin, fewer errors (risk overfit). Low C = large margin, more errors (more robust).
- Maps data to higher-dimensional space implicitly — without computing the transformation explicitly
- Linear: Use for high-dimensional/text data
- RBF (Gaussian): Default; works for most non-linear problems. Tune C and γ
- Polynomial: For polynomial relationships
| Kernel | Formula | Use When |
|---|---|---|
| Linear | K(x,z) = xᵀz | High-dim, linearly separable (text/NLP) |
| RBF / Gaussian | K(x,z) = exp(−γ‖x−z‖²) | General non-linear; most common default |
| Polynomial | K(x,z) = (xᵀz + c)ᵈ | Known polynomial structure in features |
| Sigmoid | K(x,z) = tanh(αxᵀz + c) | Neural-net-like; rarely used |
Clustering — K-Means
K-Means partitions data into k clusters by iteratively assigning points to the nearest centroid and recomputing centroids as the mean of assigned points. It's an unsupervised algorithm — no labels needed. Convergence is guaranteed but may find a local minimum.
- Elbow method: Plot inertia (within-cluster sum of squares) vs k. The "elbow" is where adding more clusters yields diminishing returns.
- Silhouette score: Measures how similar a point is to its own cluster vs neighboring clusters. Range −1 to 1; higher is better.
- Gap statistic: Compares inertia to a null reference distribution.
- Assumes spherical, equally-sized clusters (fails on elongated or ring shapes)
- Sensitive to outliers — one outlier becomes its own centroid
- Must specify k in advance
- Results vary by initialization — run multiple times (
n_init=10) - Alternatives: DBSCAN (density, arbitrary shapes), Gaussian Mixture Models (probabilistic)
Accuracy
Accuracy measures the fraction of predictions the model got right. It's the most intuitive metric but can be deeply misleading on imbalanced datasets. A model that always predicts "Not Fraud" on a dataset with 99% non-fraud achieves 99% accuracy while being completely useless.
Precision / Recall
Precision and Recall measure complementary aspects of a classifier's performance and exist in a fundamental tradeoff. Adjusting the decision threshold moves you along the precision-recall curve — you cannot maximize both simultaneously without sacrificing the other.
- Spam detection — false positive (blocking real email) is costly
- Recommendation systems — irrelevant suggestions hurt UX
- Content moderation — wrongly banning users is severe
- Cancer screening — missing a true case is catastrophic
- Fraud detection — missing a fraud is more costly than a false alarm
- Security alerts — missing a real threat is unacceptable
F1 Score
F1 is the harmonic mean of Precision and Recall. It punishes extreme imbalances — if either precision or recall is near zero, F1 will be low regardless of the other. The go-to single metric for imbalanced classification problems.
Fβ = (1+β²) × (P×R) / (β²×P + R). When β < 1, weights precision more. When β > 1, weights recall more. F2 (β=2) used in retrieval where missing items is costly.
- Macro F1: Unweighted average across classes — treats all classes equally
- Weighted F1: Average weighted by class frequency — useful when class imbalance matters
- Micro F1: Global TP/FP/FN — equals accuracy for binary tasks
Confusion Matrix
The confusion matrix gives a complete breakdown of prediction outcomes — it shows not just how many were wrong, but how they were wrong. From these four numbers, all other classification metrics can be derived.
- TP: Correctly predicted positive — model got it right
- TN: Correctly predicted negative — model got it right
- FP (Type I): Predicted positive, actually negative — false alarm
- FN (Type II): Predicted negative, actually positive — missed case
- Sensitivity / Recall / TPR: TP / (TP+FN)
- Specificity / TNR: TN / (TN+FP)
- Fall-out / FPR: FP / (FP+TN)
- Balanced Accuracy: (Sensitivity + Specificity) / 2
- MCC: Matthews Correlation Coefficient — gold standard for binary imbalanced tasks
ROC-AUC
The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate across all decision thresholds. AUC (Area Under the Curve) summarizes performance into a single number — representing the probability that the model ranks a random positive example higher than a random negative.
Training Pipeline
A production ML training pipeline is more than fitting a model — it's a reproducible, versioned sequence from raw data to deployable artifact. Every step must be trackable and independently testable.
Validation
Validation is the practice of estimating how well a model will generalize to unseen data. The central challenge is the bias-variance tradeoff — a model complex enough to learn the training data well may not generalize, and one too simple misses the signal entirely.
- Low training error, high validation error — memorizing rather than learning
- Symptoms: Training accuracy ≫ validation accuracy
- Fixes: More training data, regularization, simpler model, dropout, early stopping, data augmentation
- High training error and high validation error — model too simple to capture patterns
- Symptoms: Both training and validation accuracy are poor
- Fixes: More complex model, more features, reduce regularization, train longer
Train / Validation / Test Split
- Training set (60–80%): Model learns from this. Fit all parameters here.
- Validation set (10–20%): Tune hyperparameters. Select architecture. Compare models. Do NOT use for final evaluation.
- Test set (10–20%): Touched once, at the very end. The unbiased estimate of production performance. Peeking contaminates it.
Hyperparameter Tuning
Hyperparameters are the knobs set before training — they control the learning process itself (e.g., learning rate, tree depth, regularization strength). Unlike model parameters, they are not learned from data and must be searched or tuned externally.
Exhaustive search over all combinations of a pre-defined hyperparameter grid. Guaranteed to find the best combination in the grid, but computationally expensive — cost grows multiplicatively with parameters.
Randomly samples combinations from the hyperparameter space. Empirically outperforms grid search because most hyperparameters have low importance — random search explores more unique values of the important ones. Use when budget-constrained.
Builds a probabilistic model of the objective function and uses it to select the most promising next configuration. Far more sample-efficient than grid or random search. Libraries: Optuna, Hyperopt, BayesianOptimization.
Cross-Validation
Cross-validation provides a more reliable estimate of model performance than a single train/val split by rotating which portion of data is used for validation. It uses every sample for both training and validation, giving a lower-variance estimate of generalization error.
- k-Fold CV: Split into k equal folds, rotate validation fold. k=5 or k=10 are standard.
- Stratified k-Fold: Preserves class proportions in each fold. Use for classification, especially imbalanced.
- Leave-One-Out (LOO): k = n. Low bias, very high variance + expensive. Use only for tiny datasets.
- Time-Series CV: Always train on past, validate on future. Never shuffle time series data.
- Repeated k-Fold: Run k-fold multiple times with different shuffles — more reliable estimate.
- Mean score: Expected performance on unseen data
- Standard deviation: Stability of the model — high std = unstable; consider more data or simpler model
- Rule of thumb: Report mean ± std. "F1 = 0.87 ± 0.03 (5-fold CV)" is a complete, honest result.
- Final model: After CV-based selection, retrain on ALL training data, evaluate once on test set.