??????? Scikit-Learn Handbook � Reference Guide
V1
Back to handbooks index
Machine Learning in Python

Scikit-Learn
Handbook

A comprehensive reference guide for data scientists and ML engineers — covering installation, core APIs, algorithms, evaluation, pipelines, and real-world use cases.

Supervised Learning Unsupervised Learning Model Evaluation Pipelines Preprocessing
📖

Introduction

What is scikit-learn and when to use it

Scikit-learn is an open-source machine learning library built on NumPy, SciPy, and Matplotlib. It provides simple and efficient tools for data mining and data analysis — covering classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

Consistent API

Every estimator follows the same fit() / predict() / transform() pattern, making it trivial to swap algorithms.

Battle-Tested

Backed by decades of research. Used at Spotify, Booking.com, J.P. Morgan, and thousands of production ML pipelines worldwide.

Interoperable

Composes seamlessly with Pandas DataFrames, NumPy arrays, and integrates with Optuna, MLflow, and ONNX for production.

ℹ️
Scope: scikit-learn is best for tabular data and classic ML. For deep learning or LLMs, prefer PyTorch / Hugging Face. For massive distributed workloads, consider Spark MLlib or Ray.
📦

Installation

pip, conda, and virtual environments

Via pip (recommended)

bash
# Create a virtual environment first (best practice) python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install scikit-learn pip install scikit-learn numpy pandas matplotlib seaborn # full stack

Via conda

bash
conda create -n sklearn-env python=3.11 conda activate sklearn-env conda install -c conda-forge scikit-learn

Verify Installation

python
import sklearn print(sklearn.__version__) # e.g. 1.5.0 import numpy as np import pandas as pd

Quick Start

Your first model in under 20 lines

The classic 5-step pattern — load data, split, fit, predict, evaluate — applies to virtually every scikit-learn workflow.

python
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, classification_report # 1. Load built-in dataset X, y = load_iris(return_X_y=True) # 2. Split into training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # 3. Instantiate and train a model clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) # 4. Predict y_pred = clf.predict(X_test) # 5. Evaluate print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}") print(classification_report(y_test, y_pred))
Pro tip: Always set random_state for reproducibility. Use stratify=y in train_test_split for imbalanced datasets to maintain class proportions in both splits.
🔄

ML Workflow

End-to-end machine learning pipeline stages
🗄️
Step 1
Data Load
🔧
Step 2
Preprocess
✂️
Step 3
Split
🏋️
Step 4
Train
📊
Step 5
Evaluate
🚀
Step 6
Deploy
🗄️

Built-in Datasets

sklearn.datasets — toy, real-world, and generators
Function Task Samples Features
load_iris()Classification1504 numeric
load_digits()Classification179764 (8×8 images)
load_wine()Classification17813 numeric
load_breast_cancer()Binary Classification56930 numeric
load_diabetes()Regression44210 numeric
load_boston() deprecatedRegression50613 numeric
fetch_california_housing()Regression206408 numeric
fetch_20newsgroups()Text Classification18846Text
make_classification()Synthetic clfConfigurableConfigurable
make_regression()Synthetic regConfigurableConfigurable
make_blobs()ClusteringConfigurableConfigurable
python
from sklearn.datasets import load_breast_cancer, make_classification # Load a real dataset as a DataFrame data = load_breast_cancer(as_frame=True) df = data.frame print(df.head()) # Generate synthetic data X, y = make_classification( n_samples=1000, n_features=20, n_informative=10, n_redundant=5, class_sep=1.5, random_state=42 )
🔧

Preprocessing

sklearn.preprocessing — scaling, encoding, imputation

Scaling Numerical Features

python
from sklearn.preprocessing import ( StandardScaler, # mean=0, std=1 MinMaxScaler, # scale to [0, 1] RobustScaler, # robust to outliers (uses IQR) Normalizer # normalize each sample to unit norm ) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # fit on train only! X_test_scaled = scaler.transform(X_test) # transform test separately
⚠️
Data Leakage Warning: Never call fit_transform() on your test set. Always fit() on training data only, then transform() on test data. Use Pipelines to enforce this automatically.

Encoding Categorical Features

python
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder # One-hot for nominal categories (no ordering) ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False) X_encoded = ohe.fit_transform(X_cat) # Label encoding for target variable y le = LabelEncoder() y_encoded = le.fit_transform(y_strings) # Ordinal encoding for ordered categories (e.g. low/med/high) oe = OrdinalEncoder(categories=[['low', 'medium', 'high']]) X_ord = oe.fit_transform(X_cat)

Handling Missing Values

python
from sklearn.impute import SimpleImputer, KNNImputer # Strategy: 'mean', 'median', 'most_frequent', 'constant' imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X) # KNN imputation — better for correlated features knn_imp = KNNImputer(n_neighbors=5) X_knn = knn_imp.fit_transform(X)
🔗

Pipelines

Chain preprocessing + model into a single estimator

Pipelines eliminate data leakage, simplify cross-validation, and make models deployable as single objects. Always prefer Pipelines in production code.

python
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier num_cols = ['age', 'salary', 'tenure'] cat_cols = ['department', 'region'] # Numerical pipeline num_pipe = Pipeline([ ('impute', SimpleImputer(strategy='median')), ('scale', StandardScaler()), ]) # Categorical pipeline cat_pipe = Pipeline([ ('impute', SimpleImputer(strategy='most_frequent')), ('encode', OneHotEncoder(handle_unknown='ignore')), ]) # Combine with ColumnTransformer preprocessor = ColumnTransformer([ ('num', num_pipe, num_cols), ('cat', cat_pipe, cat_cols), ]) # Full pipeline full_pipe = Pipeline([ ('preprocess', preprocessor), ('model', GradientBoostingClassifier(n_estimators=200)), ]) full_pipe.fit(X_train, y_train) y_pred = full_pipe.predict(X_test)
Save & Load Pipelines: Use joblib.dump(full_pipe, 'model.pkl') and joblib.load('model.pkl') to persist your entire pipeline including fitted transformers.
✂️

Train / Test Split

sklearn.model_selection
python
from sklearn.model_selection import train_test_split # Standard split — 80% train, 20% test X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y # preserves class distribution ) # Three-way split: train / val / test X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42) X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.15, random_state=42)
🎯

Supervised Learning

Classification and Regression algorithms
Linear Models
  • LinearRegression — OLS regression
  • LogisticRegression — classification
  • Ridge / Lasso — L2/L1 regularized
  • ElasticNet — combined L1+L2
  • SGDClassifier — scalable online learning
Tree-Based
  • DecisionTreeClassifier
  • DecisionTreeRegressor
  • RandomForestClassifier
  • GradientBoostingClassifier
  • HistGradientBoostingClassifier
Support Vector Machines
  • SVC — classification (kernel trick)
  • SVR — regression
  • LinearSVC — faster for large datasets
  • NuSVC / NuSVR
Neighbors & Naive Bayes
  • KNeighborsClassifier
  • KNeighborsRegressor
  • GaussianNB
  • MultinomialNB — text classification
  • BernoulliNB

Estimator Cheat Pattern

python
# Every supervised estimator follows this pattern: model = SomeEstimator(**params) model.fit(X_train, y_train) y_pred = model.predict(X_test) # hard labels y_prob = model.predict_proba(X_test) # class probabilities y_score = model.decision_function(X_test) # raw scores (SVM) score = model.score(X_test, y_test) # default metric (R² or accuracy) # Access model parameters print(model.get_params()) print(model.feature_importances_) # tree-based models print(model.coef_) # linear models
🔍

Unsupervised Learning

Clustering and Dimensionality Reduction
Clustering
  • KMeans — partition-based
  • DBSCAN — density-based, finds outliers
  • AgglomerativeClustering
  • GaussianMixture — soft clustering
  • MeanShift
Dimensionality Reduction
  • PCA — linear, variance-maximizing
  • TruncatedSVD — sparse matrices
  • TSNE — 2D/3D visualization
  • UMAP — via umap-learn
  • NMF — non-negative matrix factorization
python
from sklearn.cluster import KMeans, DBSCAN from sklearn.decomposition import PCA # K-Means clustering km = KMeans(n_clusters=3, n_init='auto', random_state=42) labels = km.fit_predict(X_scaled) print(km.inertia_) # within-cluster sum of squares print(km.cluster_centers_) # DBSCAN — no need to specify k db = DBSCAN(eps=0.5, min_samples=5) labels = db.fit_predict(X_scaled) # -1 = noise/outlier # PCA dimensionality reduction pca = PCA(n_components=2) X_2d = pca.fit_transform(X_scaled) print(pca.explained_variance_ratio_) # variance per component
🌲

Ensemble Methods

Bagging, Boosting, Stacking, Voting
python
from sklearn.ensemble import ( RandomForestClassifier, # Bagging GradientBoostingClassifier, # Boosting HistGradientBoostingClassifier, # Fast boosting (like LightGBM) AdaBoostClassifier, # Adaptive Boosting VotingClassifier, # Majority vote StackingClassifier, # Meta-learner stacking BaggingClassifier, # Generic bagging ) from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC # ── Voting Classifier (hard vote) ── voters = VotingClassifier(estimators=[ ('rf', RandomForestClassifier(n_estimators=100)), ('gb', GradientBoostingClassifier()), ('svc', SVC(probability=True)), ], voting='soft') # ── Stacking Classifier ── stack = StackingClassifier( estimators=[ ('rf', RandomForestClassifier()), ('gb', GradientBoostingClassifier()), ], final_estimator=LogisticRegression(), cv=5 ) # ── HistGradientBoosting — handles NaN natively ── hgb = HistGradientBoostingClassifier( max_iter=200, learning_rate=0.05, max_depth=6, l2_regularization=0.1 ) hgb.fit(X_train, y_train)
📊

Metrics

sklearn.metrics — evaluation for every task type

Classification Metrics

python
from sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, average_precision_score, confusion_matrix, classification_report, ConfusionMatrixDisplay, RocCurveDisplay ) print(classification_report(y_test, y_pred, target_names=['neg', 'pos'])) print("ROC-AUC:", roc_auc_score(y_test, y_prob[:, 1])) ConfusionMatrixDisplay.from_predictions(y_test, y_pred) RocCurveDisplay.from_predictions(y_test, y_prob[:, 1])

Regression Metrics

python
from sklearn.metrics import ( mean_absolute_error, # MAE mean_squared_error, # MSE r2_score, # R² mean_absolute_percentage_error, # MAPE explained_variance_score ) mae = mean_absolute_error(y_test, y_pred) rmse = mean_squared_error(y_test, y_pred, squared=False) r2 = r2_score(y_test, y_pred) print(f"MAE={mae:.3f} RMSE={rmse:.3f} R²={r2:.4f}")

Clustering Metrics

python
from sklearn.metrics import ( silhouette_score, # no ground truth needed adjusted_rand_score, # requires ground truth normalized_mutual_info_score ) sil = silhouette_score(X_scaled, labels) # [-1, 1], higher is better
🔁

Cross-Validation

sklearn.model_selection — robust evaluation strategies
python
from sklearn.model_selection import ( cross_val_score, # simple k-fold CV cross_validate, # multiple metrics StratifiedKFold, # preserves class balance RepeatedStratifiedKFold, # repeat for stability LeaveOneOut # LOO for small datasets ) # Simple 5-fold CV scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted') print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}") # Multiple metrics in one pass results = cross_validate(model, X, y, cv=5, scoring=['accuracy', 'roc_auc']) # Stratified for imbalanced classes skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
🎛️

Hyperparameter Tuning

Grid search, random search, and Bayesian optimization
python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # ── GridSearchCV — exhaustive search ── param_grid = { 'n_estimators': [100, 300, 500], 'max_depth': [3, 5, 7, None], 'min_samples_split': [2, 5, 10], } gs = GridSearchCV( RandomForestClassifier(), param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1 ) gs.fit(X_train, y_train) print(gs.best_params_, gs.best_score_) # ── RandomizedSearchCV — faster for large spaces ── from scipy.stats import randint, uniform param_dist = { 'n_estimators': randint(50, 500), 'max_depth': randint(3, 15), 'max_features': uniform(0.3, 0.7), } rs = RandomizedSearchCV( RandomForestClassifier(), param_dist, n_iter=50, cv=5, scoring='roc_auc', n_jobs=-1, random_state=42 ) rs.fit(X_train, y_train)
🏷️

Use Case: Churn Prediction

End-to-end binary classification with imbalanced classes

Customer churn prediction is a canonical binary classification task — often with class imbalance (churned customers are rare). Here's a production-style pipeline:

python
import pandas as pd from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.ensemble import HistGradientBoostingClassifier from sklearn.model_selection import StratifiedKFold, cross_validate from sklearn.metrics import classification_report, roc_auc_score import joblib # Assume df is a Pandas DataFrame with a 'churned' column feature_cols = [c for c in df.columns if c != 'churned'] num_cols = df[feature_cols].select_dtypes('number').columns.tolist() cat_cols = df[feature_cols].select_dtypes('object').columns.tolist() X = df[feature_cols] y = df['churned'] preprocessor = ColumnTransformer([ ('num', StandardScaler(), num_cols), ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols), ]) pipe = Pipeline([ ('prep', preprocessor), ('model', HistGradientBoostingClassifier( max_iter=300, class_weight='balanced', # handles imbalance learning_rate=0.05, max_depth=5 )), ]) cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) results = cross_validate(pipe, X, y, cv=cv, scoring=['roc_auc', 'f1_weighted'], n_jobs=-1) print("ROC-AUC:", results['test_roc_auc'].mean().round(4)) # Fit final model and persist pipe.fit(X, y) joblib.dump(pipe, 'churn_model.pkl')
📈

Use Case: House Price Prediction

Regression with feature engineering and Ridge
python
from sklearn.datasets import fetch_california_housing from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, r2_score data = fetch_california_housing(as_frame=True) X, y = data.data, data.target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) pipe = Pipeline([ ('scale', StandardScaler()), ('poly', PolynomialFeatures(degree=2, include_bias=False)), ('ridge', Ridge(alpha=10.0)), ]) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print(f"MAE : {mean_absolute_error(y_test, y_pred):.3f}") print(f"R² : {r2_score(y_test, y_pred):.4f}")
🔵

Use Case: Customer Segmentation

K-Means clustering + PCA visualization
python
import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score X_scaled = StandardScaler().fit_transform(X_customers) # Elbow method to choose k inertias = [] sil_scores = [] K_range = range(2, 11) for k in K_range: km = KMeans(n_clusters=k, n_init='auto', random_state=42) labels = km.fit_predict(X_scaled) inertias.append(km.inertia_) sil_scores.append(silhouette_score(X_scaled, labels)) # Fit best k and visualize with PCA best_k = 4 km_final = KMeans(n_clusters=best_k, n_init='auto', random_state=42) labels = km_final.fit_predict(X_scaled) X_2d = PCA(n_components=2).fit_transform(X_scaled) plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', alpha=0.7) plt.title('Customer Segments (PCA 2D)') plt.show()
💬

Use Case: Text Classification

TF-IDF + Naive Bayes for spam / sentiment
python
from sklearn.datasets import fetch_20newsgroups from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report categories = ['sci.space', 'comp.graphics', 'talk.politics.guns', 'rec.sport.hockey'] train = fetch_20newsgroups(subset='train', categories=categories) test = fetch_20newsgroups(subset='test', categories=categories) # TF-IDF + Multinomial Naive Bayes pipeline text_pipe = Pipeline([ ('tfidf', TfidfVectorizer( max_features=30000, ngram_range=(1, 2), stop_words='english', sublinear_tf=True )), ('clf', MultinomialNB(alpha=0.1)), ]) text_pipe.fit(train.data, train.target) y_pred = text_pipe.predict(test.data) print(classification_report(test.target, y_pred, target_names=categories)) # Predict on new text sample = ["The rocket launched successfully from Kennedy Space Center"] print(categories[text_pipe.predict(sample)[0]])
📋

API Reference

Core modules, classes, and key parameters

Universal Estimator API

MethodDescriptionReturns
fit(X, y)Train the model on dataself
predict(X)Predict target for Xarray
predict_proba(X)Class probability estimatesarray [n, classes]
transform(X)Apply transformation (transformers)array
fit_transform(X)Fit then transform in one steparray
score(X, y)Default evaluation metricfloat
get_params()Get hyperparameter dictdict
set_params(**p)Set hyperparametersself

Key Module Index

ModulePurpose
sklearn.datasetsBuilt-in datasets and data generators
sklearn.preprocessingScaling, encoding, normalizing
sklearn.imputeMissing value strategies
sklearn.pipelinePipeline and FeatureUnion
sklearn.composeColumnTransformer, TransformedTargetRegressor
sklearn.model_selectionSplits, CV, grid search
sklearn.linear_modelLinear/logistic regression, Ridge, Lasso
sklearn.treeDecision trees
sklearn.ensembleRF, GBM, AdaBoost, Voting, Stacking
sklearn.svmSVC, SVR, LinearSVC
sklearn.neighborsKNN classifier and regressor
sklearn.naive_bayesGaussian, Multinomial, Bernoulli NB
sklearn.clusterKMeans, DBSCAN, AgglomerativeClustering
sklearn.decompositionPCA, NMF, TruncatedSVD
sklearn.metricsAll evaluation metrics
sklearn.feature_selectionSelectKBest, RFE, RFECV
sklearn.feature_extraction.textCountVectorizer, TfidfVectorizer
sklearn.inspectionpermutation_importance, partial_dependence
🗒️

Cheat Sheet

Quick-reference snippets for daily use
Feature Importance
import pandas as pd feat_imp = pd.Series( rf.feature_importances_, index=feature_names ).sort_values(ascending=False) feat_imp.head(10).plot(kind='bar')
Save / Load Model
import joblib # Save joblib.dump(model, 'model.pkl') # Load model = joblib.load('model.pkl')
Permutation Importance
from sklearn.inspection import \ permutation_importance r = permutation_importance( model, X_test, y_test, n_repeats=10, random_state=42 )
SHAP Integration
import shap explainer = shap.TreeExplainer(rf) sv = explainer.shap_values(X_test) shap.summary_plot(sv, X_test)
📚
Further Reading: Official docs at scikit-learn.org/stable — includes the Algorithm Cheat Sheet, User Guide, and API reference. For production deployment, explore sklearn2pmml, sklearn-onnx, and BentoML integrations.