Machine Learning in Python
Scikit-Learn
Handbook
A comprehensive reference guide for data scientists and ML engineers — covering installation, core APIs, algorithms, evaluation, pipelines, and real-world use cases.
Supervised Learning
Unsupervised Learning
Model Evaluation
Pipelines
Preprocessing
Scikit-learn is an open-source machine learning library built on NumPy, SciPy, and Matplotlib. It provides simple and efficient tools for data mining and data analysis — covering classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
Consistent API
Every estimator follows the same fit() / predict() / transform() pattern, making it trivial to swap algorithms.
Battle-Tested
Backed by decades of research. Used at Spotify, Booking.com, J.P. Morgan, and thousands of production ML pipelines worldwide.
Interoperable
Composes seamlessly with Pandas DataFrames, NumPy arrays, and integrates with Optuna, MLflow, and ONNX for production.
ℹ️
Scope: scikit-learn is best for tabular data and classic ML. For deep learning or LLMs, prefer PyTorch / Hugging Face. For massive distributed workloads, consider Spark MLlib or Ray.
Via pip (recommended)
# Create a virtual environment first (best practice)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install scikit-learn
pip install scikit-learn numpy pandas matplotlib seaborn # full stack
Via conda
conda create -n sklearn-env python=3.11
conda activate sklearn-env
conda install -c conda-forge scikit-learn
Verify Installation
import sklearn
print(sklearn.__version__) # e.g. 1.5.0
import numpy as np
import pandas as pd
The classic 5-step pattern — load data, split, fit, predict, evaluate — applies to virtually every scikit-learn workflow.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
# 1. Load built-in dataset
X, y = load_iris(return_X_y=True)
# 2. Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Instantiate and train a model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
# 4. Predict
y_pred = clf.predict(X_test)
# 5. Evaluate
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))
✅
Pro tip: Always set random_state for reproducibility. Use stratify=y in train_test_split for imbalanced datasets to maintain class proportions in both splits.
| Function |
Task |
Samples |
Features |
| load_iris() | Classification | 150 | 4 numeric |
| load_digits() | Classification | 1797 | 64 (8×8 images) |
| load_wine() | Classification | 178 | 13 numeric |
| load_breast_cancer() | Binary Classification | 569 | 30 numeric |
| load_diabetes() | Regression | 442 | 10 numeric |
| load_boston() deprecated | Regression | 506 | 13 numeric |
| fetch_california_housing() | Regression | 20640 | 8 numeric |
| fetch_20newsgroups() | Text Classification | 18846 | Text |
| make_classification() | Synthetic clf | Configurable | Configurable |
| make_regression() | Synthetic reg | Configurable | Configurable |
| make_blobs() | Clustering | Configurable | Configurable |
from sklearn.datasets import load_breast_cancer, make_classification
# Load a real dataset as a DataFrame
data = load_breast_cancer(as_frame=True)
df = data.frame
print(df.head())
# Generate synthetic data
X, y = make_classification(
n_samples=1000, n_features=20,
n_informative=10, n_redundant=5,
class_sep=1.5, random_state=42
)
Scaling Numerical Features
from sklearn.preprocessing import (
StandardScaler, # mean=0, std=1
MinMaxScaler, # scale to [0, 1]
RobustScaler, # robust to outliers (uses IQR)
Normalizer # normalize each sample to unit norm
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit on train only!
X_test_scaled = scaler.transform(X_test) # transform test separately
⚠️
Data Leakage Warning: Never call fit_transform() on your test set. Always fit() on training data only, then transform() on test data. Use Pipelines to enforce this automatically.
Encoding Categorical Features
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
# One-hot for nominal categories (no ordering)
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
X_encoded = ohe.fit_transform(X_cat)
# Label encoding for target variable y
le = LabelEncoder()
y_encoded = le.fit_transform(y_strings)
# Ordinal encoding for ordered categories (e.g. low/med/high)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
X_ord = oe.fit_transform(X_cat)
Handling Missing Values
from sklearn.impute import SimpleImputer, KNNImputer
# Strategy: 'mean', 'median', 'most_frequent', 'constant'
imputer = SimpleImputer(strategy='median')
X_imputed = imputer.fit_transform(X)
# KNN imputation — better for correlated features
knn_imp = KNNImputer(n_neighbors=5)
X_knn = knn_imp.fit_transform(X)
Pipelines eliminate data leakage, simplify cross-validation, and make models deployable as single objects. Always prefer Pipelines in production code.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import GradientBoostingClassifier
num_cols = ['age', 'salary', 'tenure']
cat_cols = ['department', 'region']
# Numerical pipeline
num_pipe = Pipeline([
('impute', SimpleImputer(strategy='median')),
('scale', StandardScaler()),
])
# Categorical pipeline
cat_pipe = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(handle_unknown='ignore')),
])
# Combine with ColumnTransformer
preprocessor = ColumnTransformer([
('num', num_pipe, num_cols),
('cat', cat_pipe, cat_cols),
])
# Full pipeline
full_pipe = Pipeline([
('preprocess', preprocessor),
('model', GradientBoostingClassifier(n_estimators=200)),
])
full_pipe.fit(X_train, y_train)
y_pred = full_pipe.predict(X_test)
✅
Save & Load Pipelines: Use joblib.dump(full_pipe, 'model.pkl') and joblib.load('model.pkl') to persist your entire pipeline including fitted transformers.
from sklearn.model_selection import train_test_split
# Standard split — 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y # preserves class distribution
)
# Three-way split: train / val / test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.15, random_state=42)
Linear Models
LinearRegression — OLS regression
LogisticRegression — classification
Ridge / Lasso — L2/L1 regularized
ElasticNet — combined L1+L2
SGDClassifier — scalable online learning
Tree-Based
DecisionTreeClassifier
DecisionTreeRegressor
RandomForestClassifier
GradientBoostingClassifier
HistGradientBoostingClassifier ⚡
Support Vector Machines
SVC — classification (kernel trick)
SVR — regression
LinearSVC — faster for large datasets
NuSVC / NuSVR
Neighbors & Naive Bayes
KNeighborsClassifier
KNeighborsRegressor
GaussianNB
MultinomialNB — text classification
BernoulliNB
Estimator Cheat Pattern
# Every supervised estimator follows this pattern:
model = SomeEstimator(**params)
model.fit(X_train, y_train)
y_pred = model.predict(X_test) # hard labels
y_prob = model.predict_proba(X_test) # class probabilities
y_score = model.decision_function(X_test) # raw scores (SVM)
score = model.score(X_test, y_test) # default metric (R² or accuracy)
# Access model parameters
print(model.get_params())
print(model.feature_importances_) # tree-based models
print(model.coef_) # linear models
Clustering
KMeans — partition-based
DBSCAN — density-based, finds outliers
AgglomerativeClustering
GaussianMixture — soft clustering
MeanShift
Dimensionality Reduction
PCA — linear, variance-maximizing
TruncatedSVD — sparse matrices
TSNE — 2D/3D visualization
UMAP — via umap-learn
NMF — non-negative matrix factorization
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
# K-Means clustering
km = KMeans(n_clusters=3, n_init='auto', random_state=42)
labels = km.fit_predict(X_scaled)
print(km.inertia_) # within-cluster sum of squares
print(km.cluster_centers_)
# DBSCAN — no need to specify k
db = DBSCAN(eps=0.5, min_samples=5)
labels = db.fit_predict(X_scaled)
# -1 = noise/outlier
# PCA dimensionality reduction
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print(pca.explained_variance_ratio_) # variance per component
from sklearn.ensemble import (
RandomForestClassifier, # Bagging
GradientBoostingClassifier, # Boosting
HistGradientBoostingClassifier, # Fast boosting (like LightGBM)
AdaBoostClassifier, # Adaptive Boosting
VotingClassifier, # Majority vote
StackingClassifier, # Meta-learner stacking
BaggingClassifier, # Generic bagging
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
# ── Voting Classifier (hard vote) ──
voters = VotingClassifier(estimators=[
('rf', RandomForestClassifier(n_estimators=100)),
('gb', GradientBoostingClassifier()),
('svc', SVC(probability=True)),
], voting='soft')
# ── Stacking Classifier ──
stack = StackingClassifier(
estimators=[
('rf', RandomForestClassifier()),
('gb', GradientBoostingClassifier()),
],
final_estimator=LogisticRegression(),
cv=5
)
# ── HistGradientBoosting — handles NaN natively ──
hgb = HistGradientBoostingClassifier(
max_iter=200, learning_rate=0.05,
max_depth=6, l2_regularization=0.1
)
hgb.fit(X_train, y_train)
Classification Metrics
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, average_precision_score,
confusion_matrix, classification_report,
ConfusionMatrixDisplay, RocCurveDisplay
)
print(classification_report(y_test, y_pred, target_names=['neg', 'pos']))
print("ROC-AUC:", roc_auc_score(y_test, y_prob[:, 1]))
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
RocCurveDisplay.from_predictions(y_test, y_prob[:, 1])
Regression Metrics
from sklearn.metrics import (
mean_absolute_error, # MAE
mean_squared_error, # MSE
r2_score, # R²
mean_absolute_percentage_error, # MAPE
explained_variance_score
)
mae = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"MAE={mae:.3f} RMSE={rmse:.3f} R²={r2:.4f}")
Clustering Metrics
from sklearn.metrics import (
silhouette_score, # no ground truth needed
adjusted_rand_score, # requires ground truth
normalized_mutual_info_score
)
sil = silhouette_score(X_scaled, labels) # [-1, 1], higher is better
from sklearn.model_selection import (
cross_val_score, # simple k-fold CV
cross_validate, # multiple metrics
StratifiedKFold, # preserves class balance
RepeatedStratifiedKFold, # repeat for stability
LeaveOneOut # LOO for small datasets
)
# Simple 5-fold CV
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
# Multiple metrics in one pass
results = cross_validate(model, X, y, cv=5, scoring=['accuracy', 'roc_auc'])
# Stratified for imbalanced classes
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# ── GridSearchCV — exhaustive search ──
param_grid = {
'n_estimators': [100, 300, 500],
'max_depth': [3, 5, 7, None],
'min_samples_split': [2, 5, 10],
}
gs = GridSearchCV(
RandomForestClassifier(), param_grid,
cv=5, scoring='roc_auc', n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print(gs.best_params_, gs.best_score_)
# ── RandomizedSearchCV — faster for large spaces ──
from scipy.stats import randint, uniform
param_dist = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 15),
'max_features': uniform(0.3, 0.7),
}
rs = RandomizedSearchCV(
RandomForestClassifier(), param_dist,
n_iter=50, cv=5, scoring='roc_auc',
n_jobs=-1, random_state=42
)
rs.fit(X_train, y_train)
Customer churn prediction is a canonical binary classification task — often with class imbalance (churned customers are rare). Here's a production-style pipeline:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import classification_report, roc_auc_score
import joblib
# Assume df is a Pandas DataFrame with a 'churned' column
feature_cols = [c for c in df.columns if c != 'churned']
num_cols = df[feature_cols].select_dtypes('number').columns.tolist()
cat_cols = df[feature_cols].select_dtypes('object').columns.tolist()
X = df[feature_cols]
y = df['churned']
preprocessor = ColumnTransformer([
('num', StandardScaler(), num_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
])
pipe = Pipeline([
('prep', preprocessor),
('model', HistGradientBoostingClassifier(
max_iter=300, class_weight='balanced', # handles imbalance
learning_rate=0.05, max_depth=5
)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(pipe, X, y, cv=cv,
scoring=['roc_auc', 'f1_weighted'], n_jobs=-1)
print("ROC-AUC:", results['test_roc_auc'].mean().round(4))
# Fit final model and persist
pipe.fit(X, y)
joblib.dump(pipe, 'churn_model.pkl')
from sklearn.datasets import fetch_california_housing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
data = fetch_california_housing(as_frame=True)
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(degree=2, include_bias=False)),
('ridge', Ridge(alpha=10.0)),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(f"MAE : {mean_absolute_error(y_test, y_pred):.3f}")
print(f"R² : {r2_score(y_test, y_pred):.4f}")
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
X_scaled = StandardScaler().fit_transform(X_customers)
# Elbow method to choose k
inertias = []
sil_scores = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init='auto', random_state=42)
labels = km.fit_predict(X_scaled)
inertias.append(km.inertia_)
sil_scores.append(silhouette_score(X_scaled, labels))
# Fit best k and visualize with PCA
best_k = 4
km_final = KMeans(n_clusters=best_k, n_init='auto', random_state=42)
labels = km_final.fit_predict(X_scaled)
X_2d = PCA(n_components=2).fit_transform(X_scaled)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', alpha=0.7)
plt.title('Customer Segments (PCA 2D)')
plt.show()
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
categories = ['sci.space', 'comp.graphics', 'talk.politics.guns', 'rec.sport.hockey']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
# TF-IDF + Multinomial Naive Bayes pipeline
text_pipe = Pipeline([
('tfidf', TfidfVectorizer(
max_features=30000, ngram_range=(1, 2),
stop_words='english', sublinear_tf=True
)),
('clf', MultinomialNB(alpha=0.1)),
])
text_pipe.fit(train.data, train.target)
y_pred = text_pipe.predict(test.data)
print(classification_report(test.target, y_pred, target_names=categories))
# Predict on new text
sample = ["The rocket launched successfully from Kennedy Space Center"]
print(categories[text_pipe.predict(sample)[0]])
Universal Estimator API
| Method | Description | Returns |
| fit(X, y) | Train the model on data | self |
| predict(X) | Predict target for X | array |
| predict_proba(X) | Class probability estimates | array [n, classes] |
| transform(X) | Apply transformation (transformers) | array |
| fit_transform(X) | Fit then transform in one step | array |
| score(X, y) | Default evaluation metric | float |
| get_params() | Get hyperparameter dict | dict |
| set_params(**p) | Set hyperparameters | self |
Key Module Index
| Module | Purpose |
sklearn.datasets | Built-in datasets and data generators |
sklearn.preprocessing | Scaling, encoding, normalizing |
sklearn.impute | Missing value strategies |
sklearn.pipeline | Pipeline and FeatureUnion |
sklearn.compose | ColumnTransformer, TransformedTargetRegressor |
sklearn.model_selection | Splits, CV, grid search |
sklearn.linear_model | Linear/logistic regression, Ridge, Lasso |
sklearn.tree | Decision trees |
sklearn.ensemble | RF, GBM, AdaBoost, Voting, Stacking |
sklearn.svm | SVC, SVR, LinearSVC |
sklearn.neighbors | KNN classifier and regressor |
sklearn.naive_bayes | Gaussian, Multinomial, Bernoulli NB |
sklearn.cluster | KMeans, DBSCAN, AgglomerativeClustering |
sklearn.decomposition | PCA, NMF, TruncatedSVD |
sklearn.metrics | All evaluation metrics |
sklearn.feature_selection | SelectKBest, RFE, RFECV |
sklearn.feature_extraction.text | CountVectorizer, TfidfVectorizer |
sklearn.inspection | permutation_importance, partial_dependence |
Feature Importance
import pandas as pd
feat_imp = pd.Series(
rf.feature_importances_,
index=feature_names
).sort_values(ascending=False)
feat_imp.head(10).plot(kind='bar')
Save / Load Model
import joblib
# Save
joblib.dump(model, 'model.pkl')
# Load
model = joblib.load('model.pkl')
Permutation Importance
from sklearn.inspection import \
permutation_importance
r = permutation_importance(
model, X_test, y_test,
n_repeats=10, random_state=42
)
SHAP Integration
import shap
explainer = shap.TreeExplainer(rf)
sv = explainer.shap_values(X_test)
shap.summary_plot(sv, X_test)
📚
Further Reading: Official docs at scikit-learn.org/stable — includes the Algorithm Cheat Sheet, User Guide, and API reference. For production deployment, explore sklearn2pmml, sklearn-onnx, and BentoML integrations.