Scikit-Learn Handbook � Reference Guide

📖

Introduction

What is scikit-learn and when to use it

Scikit-learn is an open-source machine learning library built on NumPy, SciPy, and Matplotlib. It provides simple and efficient tools for data mining and data analysis — covering classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Consistent API

Every estimator follows the same fit() / predict() / transform() pattern, making it trivial to swap algorithms.

Battle-Tested

Backed by decades of research. Used at Spotify, Booking.com, J.P. Morgan, and thousands of production ML pipelines worldwide.

Interoperable

Composes seamlessly with Pandas DataFrames, NumPy arrays, and integrates with Optuna, MLflow, and ONNX for production.

ℹ️

Scope: scikit-learn is best for tabular data and classic ML. For deep learning or LLMs, prefer PyTorch / Hugging Face. For massive distributed workloads, consider Spark MLlib or Ray.

📦

Installation

pip, conda, and virtual environments

Via pip (recommended)

bash

# Create a virtual environment first (best practice)
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

pip install scikit-learn
pip install scikit-learn numpy pandas matplotlib seaborn  # full stack

Via conda

bash

conda create -n sklearn-env python=3.11
conda activate sklearn-env
conda install -c conda-forge scikit-learn

Verify Installation

python

import sklearn
print(sklearn.__version__)  # e.g. 1.5.0

import numpy as np
import pandas as pd

⚡

Quick Start

Your first model in under 20 lines

The classic 5-step pattern — load data, split, fit, predict, evaluate — applies to virtually every scikit-learn workflow.

python

from sklearn.datasets       import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble        import RandomForestClassifier
from sklearn.metrics         import accuracy_score, classification_report

# 1. Load built-in dataset
X, y = load_iris(return_X_y=True)

# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Instantiate and train a model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# 4. Predict
y_pred = clf.predict(X_test)

# 5. Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))

✅

Pro tip: Always set random_state for reproducibility. Use stratify=y in train_test_split for imbalanced datasets to maintain class proportions in both splits.

🔄

ML Workflow

End-to-end machine learning pipeline stages

🗄️

Step 1

Data Load

→

🔧

Step 2

Preprocess

→

✂️

Step 3

Split

→

🏋️

Step 4

Train

→

📊

Step 5

Evaluate

→

🚀

Step 6

Deploy

🗄️

Built-in Datasets

sklearn.datasets — toy, real-world, and generators

Function	Task	Samples	Features
load_iris()	Classification	150	4 numeric
load_digits()	Classification	1797	64 (8×8 images)
load_wine()	Classification	178	13 numeric
load_breast_cancer()	Binary Classification	569	30 numeric
load_diabetes()	Regression	442	10 numeric
load_boston() deprecated	Regression	506	13 numeric
fetch_california_housing()	Regression	20640	8 numeric
fetch_20newsgroups()	Text Classification	18846	Text
make_classification()	Synthetic clf	Configurable	Configurable
make_regression()	Synthetic reg	Configurable	Configurable
make_blobs()	Clustering	Configurable	Configurable

python

from sklearn.datasets import load_breast_cancer, make_classification

# Load a real dataset as a DataFrame
data = load_breast_cancer(as_frame=True)
df   = data.frame
print(df.head())

# Generate synthetic data
X, y = make_classification(
    n_samples=1000, n_features=20,
    n_informative=10, n_redundant=5,
    class_sep=1.5, random_state=42
)

🔧

Preprocessing

sklearn.preprocessing — scaling, encoding, imputation

Scaling Numerical Features

python

from sklearn.preprocessing import (
    StandardScaler,    # mean=0, std=1
    MinMaxScaler,      # scale to [0, 1]
    RobustScaler,      # robust to outliers (uses IQR)
    Normalizer         # normalize each sample to unit norm
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit on train only!
X_test_scaled  = scaler.transform(X_test)        # transform test separately

⚠️

Data Leakage Warning: Never call fit_transform() on your test set. Always fit() on training data only, then transform() on test data. Use Pipelines to enforce this automatically.

Encoding Categorical Features

python

from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder

# One-hot for nominal categories (no ordering)
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = ohe.fit_transform(X_cat)

# Label encoding for target variable y
le = LabelEncoder()
y_encoded = le.fit_transform(y_strings)

# Ordinal encoding for ordered categories (e.g. low/med/high)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ord = oe.fit_transform(X_cat)

Handling Missing Values

python

from sklearn.impute import SimpleImputer, KNNImputer

# Strategy: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)

# KNN imputation — better for correlated features
knn_imp = KNNImputer(n_neighbors=5)
X_knn = knn_imp.fit_transform(X)

🔗

Pipelines

Chain preprocessing + model into a single estimator

Pipelines eliminate data leakage, simplify cross-validation, and make models deployable as single objects. Always prefer Pipelines in production code.

python

from sklearn.pipeline        import Pipeline
from sklearn.compose         import ColumnTransformer
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.impute           import SimpleImputer
from sklearn.ensemble         import GradientBoostingClassifier

num_cols = ['age', 'salary', 'tenure']
cat_cols = ['department', 'region']

# Numerical pipeline
num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler()),
])

# Categorical pipeline
cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('encode', OneHotEncoder(handle_unknown='ignore')),
])

# Combine with ColumnTransformer
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols),
])

# Full pipeline
full_pipe = Pipeline([
    ('preprocess', preprocessor),
    ('model',      GradientBoostingClassifier(n_estimators=200)),
])

full_pipe.fit(X_train, y_train)
y_pred = full_pipe.predict(X_test)

✅

Save & Load Pipelines: Use joblib.dump(full_pipe, 'model.pkl') and joblib.load('model.pkl') to persist your entire pipeline including fitted transformers.

✂️

Train / Test Split

sklearn.model_selection

python

from sklearn.model_selection import train_test_split

# Standard split — 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y           # preserves class distribution
)

# Three-way split: train / val / test
X_temp,  X_test,  y_temp,  y_test  = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val,   y_train, y_val   = train_test_split(X_temp, y_temp, test_size=0.15, random_state=42)

🎯

Supervised Learning

Classification and Regression algorithms

Linear Models

LinearRegression — OLS regression
LogisticRegression — classification
Ridge / Lasso — L2/L1 regularized
ElasticNet — combined L1+L2
SGDClassifier — scalable online learning

Tree-Based

DecisionTreeClassifier
DecisionTreeRegressor
RandomForestClassifier
GradientBoostingClassifier
HistGradientBoostingClassifier ⚡

Support Vector Machines

SVC — classification (kernel trick)
SVR — regression
LinearSVC — faster for large datasets
NuSVC / NuSVR

Neighbors & Naive Bayes

KNeighborsClassifier
KNeighborsRegressor
GaussianNB
MultinomialNB — text classification
BernoulliNB

Estimator Cheat Pattern

python

# Every supervised estimator follows this pattern:
model = SomeEstimator(**params)
model.fit(X_train, y_train)

y_pred       = model.predict(X_test)          # hard labels
y_prob       = model.predict_proba(X_test)     # class probabilities
y_score      = model.decision_function(X_test) # raw scores (SVM)
score        = model.score(X_test, y_test)      # default metric (R² or accuracy)

# Access model parameters
print(model.get_params())
print(model.feature_importances_)  # tree-based models
print(model.coef_)                 # linear models

🔍

Unsupervised Learning

Clustering and Dimensionality Reduction

Clustering

KMeans — partition-based
DBSCAN — density-based, finds outliers
AgglomerativeClustering
GaussianMixture — soft clustering
MeanShift

Dimensionality Reduction

PCA — linear, variance-maximizing
TruncatedSVD — sparse matrices
TSNE — 2D/3D visualization
UMAP — via umap-learn
NMF — non-negative matrix factorization

python

from sklearn.cluster         import KMeans, DBSCAN
from sklearn.decomposition   import PCA

# K-Means clustering
km = KMeans(n_clusters=3, n_init='auto', random_state=42)
labels = km.fit_predict(X_scaled)
print(km.inertia_)         # within-cluster sum of squares
print(km.cluster_centers_)

# DBSCAN — no need to specify k
db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X_scaled)
# -1 = noise/outlier

# PCA dimensionality reduction
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print(pca.explained_variance_ratio_)  # variance per component

🌲

Ensemble Methods

Bagging, Boosting, Stacking, Voting

python

from sklearn.ensemble import (
    RandomForestClassifier,            # Bagging
    GradientBoostingClassifier,        # Boosting
    HistGradientBoostingClassifier,    # Fast boosting (like LightGBM)
    AdaBoostClassifier,                # Adaptive Boosting
    VotingClassifier,                  # Majority vote
    StackingClassifier,                # Meta-learner stacking
    BaggingClassifier,                 # Generic bagging
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm          import SVC

# ── Voting Classifier (hard vote) ──
voters = VotingClassifier(estimators=[
    ('rf',  RandomForestClassifier(n_estimators=100)),
    ('gb',  GradientBoostingClassifier()),
    ('svc', SVC(probability=True)),
], voting='soft')

# ── Stacking Classifier ──
stack = StackingClassifier(
    estimators=[
        ('rf',  RandomForestClassifier()),
        ('gb',  GradientBoostingClassifier()),
    ],
    final_estimator=LogisticRegression(),
    cv=5
)

# ── HistGradientBoosting — handles NaN natively ──
hgb = HistGradientBoostingClassifier(
    max_iter=200, learning_rate=0.05,
    max_depth=6, l2_regularization=0.1
)
hgb.fit(X_train, y_train)

📊

Metrics

sklearn.metrics — evaluation for every task type

Classification Metrics

python

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, average_precision_score,
    confusion_matrix, classification_report,
    ConfusionMatrixDisplay, RocCurveDisplay
)

print(classification_report(y_test, y_pred, target_names=['neg', 'pos']))
print("ROC-AUC:", roc_auc_score(y_test, y_prob[:, 1]))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
RocCurveDisplay.from_predictions(y_test, y_prob[:, 1])

Regression Metrics

python

from sklearn.metrics import (
    mean_absolute_error,   # MAE
    mean_squared_error,    # MSE
    r2_score,              # R²
    mean_absolute_percentage_error,  # MAPE
    explained_variance_score
)

mae  = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"MAE={mae:.3f}  RMSE={rmse:.3f}  R²={r2:.4f}")

Clustering Metrics

python

from sklearn.metrics import (
    silhouette_score,        # no ground truth needed
    adjusted_rand_score,     # requires ground truth
    normalized_mutual_info_score
)

sil = silhouette_score(X_scaled, labels)  # [-1, 1], higher is better

🔁

Cross-Validation

sklearn.model_selection — robust evaluation strategies

python

from sklearn.model_selection import (
    cross_val_score,         # simple k-fold CV
    cross_validate,          # multiple metrics
    StratifiedKFold,         # preserves class balance
    RepeatedStratifiedKFold, # repeat for stability
    LeaveOneOut              # LOO for small datasets
)

# Simple 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")

# Multiple metrics in one pass
results = cross_validate(model, X, y, cv=5, scoring=['accuracy', 'roc_auc'])

# Stratified for imbalanced classes
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')

🎛️

Hyperparameter Tuning

Grid search, random search, and Bayesian optimization

python

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# ── GridSearchCV — exhaustive search ──
param_grid = {
    'n_estimators':  [100, 300, 500],
    'max_depth':     [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
}
gs = GridSearchCV(
    RandomForestClassifier(), param_grid,
    cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)

# ── RandomizedSearchCV — faster for large spaces ──
from scipy.stats import randint, uniform

param_dist = {
    'n_estimators': randint(50, 500),
    'max_depth':    randint(3, 15),
    'max_features': uniform(0.3, 0.7),
}
rs = RandomizedSearchCV(
    RandomForestClassifier(), param_dist,
    n_iter=50, cv=5, scoring='roc_auc',
    n_jobs=-1, random_state=42
)
rs.fit(X_train, y_train)

🏷️

Use Case: Churn Prediction

End-to-end binary classification with imbalanced classes

Customer churn prediction is a canonical binary classification task — often with class imbalance (churned customers are rare). Here's a production-style pipeline:

python

import pandas as pd
from sklearn.pipeline        import Pipeline
from sklearn.compose         import ColumnTransformer
from sklearn.preprocessing   import StandardScaler, OneHotEncoder
from sklearn.ensemble         import HistGradientBoostingClassifier
from sklearn.model_selection  import StratifiedKFold, cross_validate
from sklearn.metrics          import classification_report, roc_auc_score
import joblib

# Assume df is a Pandas DataFrame with a 'churned' column
feature_cols = [c for c in df.columns if c != 'churned']
num_cols = df[feature_cols].select_dtypes('number').columns.tolist()
cat_cols = df[feature_cols].select_dtypes('object').columns.tolist()

X = df[feature_cols]
y = df['churned']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), num_cols),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
])

pipe = Pipeline([
    ('prep',  preprocessor),
    ('model', HistGradientBoostingClassifier(
        max_iter=300, class_weight='balanced',  # handles imbalance
        learning_rate=0.05, max_depth=5
    )),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(pipe, X, y, cv=cv,
    scoring=['roc_auc', 'f1_weighted'], n_jobs=-1)

print("ROC-AUC:", results['test_roc_auc'].mean().round(4))

# Fit final model and persist
pipe.fit(X, y)
joblib.dump(pipe, 'churn_model.pkl')

📈

Use Case: House Price Prediction

Regression with feature engineering and Ridge

python

from sklearn.datasets       import fetch_california_housing
from sklearn.pipeline        import Pipeline
from sklearn.preprocessing   import StandardScaler, PolynomialFeatures
from sklearn.linear_model    import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics         import mean_absolute_error, r2_score

data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

pipe = Pipeline([
    ('scale', StandardScaler()),
    ('poly',  PolynomialFeatures(degree=2, include_bias=False)),
    ('ridge', Ridge(alpha=10.0)),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(f"MAE : {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R²  : {r2_score(y_test, y_pred):.4f}")

🔵

Use Case: Customer Segmentation

K-Means clustering + PCA visualization

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster       import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics       import silhouette_score

X_scaled = StandardScaler().fit_transform(X_customers)

# Elbow method to choose k
inertias = []
sil_scores = []
K_range = range(2, 11)
for k in K_range:
    km = KMeans(n_clusters=k, n_init='auto', random_state=42)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sil_scores.append(silhouette_score(X_scaled, labels))

# Fit best k and visualize with PCA
best_k = 4
km_final = KMeans(n_clusters=best_k, n_init='auto', random_state=42)
labels = km_final.fit_predict(X_scaled)

X_2d = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', alpha=0.7)
plt.title('Customer Segments (PCA 2D)')
plt.show()

💬

Use Case: Text Classification

TF-IDF + Naive Bayes for spam / sentiment

python

from sklearn.datasets       import fetch_20newsgroups
from sklearn.pipeline        import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes     import MultinomialNB
from sklearn.linear_model    import LogisticRegression
from sklearn.metrics         import classification_report

categories = ['sci.space', 'comp.graphics', 'talk.politics.guns', 'rec.sport.hockey']
train = fetch_20newsgroups(subset='train', categories=categories)
test  = fetch_20newsgroups(subset='test',  categories=categories)

# TF-IDF + Multinomial Naive Bayes pipeline
text_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=30000, ngram_range=(1, 2),
        stop_words='english', sublinear_tf=True
    )),
    ('clf', MultinomialNB(alpha=0.1)),
])

text_pipe.fit(train.data, train.target)
y_pred = text_pipe.predict(test.data)

print(classification_report(test.target, y_pred, target_names=categories))

# Predict on new text
sample = ["The rocket launched successfully from Kennedy Space Center"]
print(categories[text_pipe.predict(sample)[0]])

📋

API Reference

Core modules, classes, and key parameters

Universal Estimator API

Method	Description	Returns
fit(X, y)	Train the model on data	self
predict(X)	Predict target for X	array
predict_proba(X)	Class probability estimates	array [n, classes]
transform(X)	Apply transformation (transformers)	array
fit_transform(X)	Fit then transform in one step	array
score(X, y)	Default evaluation metric	float
get_params()	Get hyperparameter dict	dict
set_params(**p)	Set hyperparameters	self

Key Module Index

Module	Purpose
`sklearn.datasets`	Built-in datasets and data generators
`sklearn.preprocessing`	Scaling, encoding, normalizing
`sklearn.impute`	Missing value strategies
`sklearn.pipeline`	Pipeline and FeatureUnion
`sklearn.compose`	ColumnTransformer, TransformedTargetRegressor
`sklearn.model_selection`	Splits, CV, grid search
`sklearn.linear_model`	Linear/logistic regression, Ridge, Lasso
`sklearn.tree`	Decision trees
`sklearn.ensemble`	RF, GBM, AdaBoost, Voting, Stacking
`sklearn.svm`	SVC, SVR, LinearSVC
`sklearn.neighbors`	KNN classifier and regressor
`sklearn.naive_bayes`	Gaussian, Multinomial, Bernoulli NB
`sklearn.cluster`	KMeans, DBSCAN, AgglomerativeClustering
`sklearn.decomposition`	PCA, NMF, TruncatedSVD
`sklearn.metrics`	All evaluation metrics
`sklearn.feature_selection`	SelectKBest, RFE, RFECV
`sklearn.feature_extraction.text`	CountVectorizer, TfidfVectorizer
`sklearn.inspection`	permutation_importance, partial_dependence

🗒️

Cheat Sheet

Quick-reference snippets for daily use

Feature Importance

import pandas as pd
feat_imp = pd.Series(
  rf.feature_importances_,
  index=feature_names
).sort_values(ascending=False)
feat_imp.head(10).plot(kind='bar')

Save / Load Model

import joblib
# Save
joblib.dump(model, 'model.pkl')
# Load
model = joblib.load('model.pkl')

Permutation Importance

from sklearn.inspection import \
  permutation_importance
r = permutation_importance(
  model, X_test, y_test,
  n_repeats=10, random_state=42
)

SHAP Integration

import shap
explainer = shap.TreeExplainer(rf)
sv = explainer.shap_values(X_test)
shap.summary_plot(sv, X_test)

📚

Further Reading: Official docs at scikit-learn.org/stable — includes the Algorithm Cheat Sheet, User Guide, and API reference. For production deployment, explore sklearn2pmml, sklearn-onnx, and BentoML integrations.

Scikit-LearnHandbook

Introduction

Installation

Via pip (recommended)

Via conda

Verify Installation

Quick Start

ML Workflow

Built-in Datasets

Preprocessing

Scaling Numerical Features

Encoding Categorical Features

Handling Missing Values

Pipelines

Train / Test Split

Supervised Learning

Estimator Cheat Pattern

Unsupervised Learning

Ensemble Methods

Metrics

Classification Metrics

Regression Metrics

Clustering Metrics

Cross-Validation

Hyperparameter Tuning

Use Case: Churn Prediction

Use Case: House Price Prediction

Use Case: Customer Segmentation

Use Case: Text Classification

API Reference

Universal Estimator API

Key Module Index

Cheat Sheet

Scikit-Learn
Handbook