Pandas Handbook — Reference Guide

📖

Introduction

What pandas is, when to use it, and its core philosophy

Pandas is the foundational library for data analysis and manipulation in Python. Built on top of NumPy, it provides two primary data structures — DataFrame (2D table) and Series (1D labeled array) — along with a vast toolkit for reading, cleaning, transforming, and aggregating data.

Labeled Data

Rows and columns have labels (indices), making data alignment automatic and joins intuitive — unlike raw NumPy arrays.

Expressive API

From method chaining to SQL-like queries, pandas offers multiple styles. Operations that take dozens of lines in other languages are one-liners.

Ecosystem Hub

Every major Python ML library — scikit-learn, XGBoost, Plotly, seaborn — accepts DataFrames as first-class input.

🆕

Pandas v2.x: Introduces Copy-on-Write (CoW) semantics for better memory behavior, improved Arrow-backed dtypes, and nullable integer/boolean types. Set pd.options.mode.copy_on_write = True to opt-in early.

📦

Installation

pip, conda, and optional dependencies

bash

# Minimal install
pip install pandas

# Full data science stack
pip install pandas numpy matplotlib seaborn openpyxl xlrd pyarrow fastparquet

# All optional dependencies
pip install "pandas[all]"

# Conda
conda install -c conda-forge pandas

Optional Dependencies by Task

Task	Package(s)
Excel read/write	`openpyxl`, `xlrd`, `xlwt`
Parquet	`pyarrow` or `fastparquet`
SQL databases	`sqlalchemy`, `pymysql`, `psycopg2`
Arrow-backed dtypes	`pyarrow ≥ 7.0`
HTML parsing	`lxml`, `html5lib`, `beautifulsoup4`
HDF5	`tables` (PyTables)
Compression	`zlib`, `bz2`, `lz4` (stdlib or pip)

python

import pandas as pd
import numpy  as np

print(pd.__version__)       # e.g. 2.2.1
pd.show_versions()         # full dependency report

⚡

Quick Start

Create, load, explore — your first DataFrame in minutes

python

import pandas as pd
import numpy as np

# ── Create from a dictionary ──
df = pd.DataFrame({
    'name':   ['Alice', 'Bob', 'Carol', 'Dave'],
    'age':    [29, 34, 27, 41],
    'salary': [82000, 95000, 71000, 110000],
    'dept':   ['Eng', 'Mktg', 'Eng', 'Finance'],
})

print(df)
#     name  age  salary     dept
# 0  Alice   29   82000      Eng
# 1    Bob   34   95000     Mktg
# 2  Carol   27   71000      Eng
# 3   Dave   41  110000  Finance

# ── Quick exploration ──
df.head(3)             # first 3 rows
df.tail(2)             # last 2 rows
df.shape               # (4, 4)
df.dtypes              # column types
df.describe()          # summary statistics
df.info()              # memory, nulls, dtypes

Sample DataFrame Output

	name	age	salary	dept
0	Alice	29	82000	Eng
1	Bob	34	95000	Mktg
2	Carol	27	71000	Eng
3	Dave	41	110000	Finance

dtypes: object(2), int64(2) · memory: 352 bytes

🧱

Data Structures

DataFrame, Series, Index — and how they relate

DataFrame

A 2D labeled table — the core data structure. Think of it as a dict of Series sharing the same index. Columns can be different dtypes.

df.columns — column labels
df.index — row labels (RangeIndex by default)
df.values — underlying NumPy array
df.shape — (rows, cols) tuple

Series

A 1D labeled array — a single column (or row) of a DataFrame. Has both values and an index.

s.values — underlying array
s.index — labels
s.name — column name when extracted from df
s.dtype — data type

python

# ── Series ──
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values')
s['b']          # → 20
s[s > 15]       # → Series(b=20, c=30)

# ── DataFrame creation patterns ──

# From list of dicts
df = pd.DataFrame([{'x':1, 'y':2}, {'x':3, 'y':4}])

# From NumPy array
df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C'])

# With custom index
df = pd.DataFrame(data, index=pd.date_range('2024-01-01', periods=5))

# Column extraction returns a Series
s_age = df['age']                # Series
df_sub = df[['name', 'salary']]   # DataFrame (double brackets)

💾

Read & Write

CSV, Excel, JSON, Parquet, SQL, HTML, Clipboard

Reading Data

python

# ── CSV ──
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv',
    sep=';',               # delimiter
    header=0,             # row to use as header
    index_col='id',       # set as index
    usecols=['a','b'],    # load only these columns
    dtype={'id':'Int64'}, # explicit types
    parse_dates=['date'], # parse date columns
    nrows=10000,          # read first N rows
    skiprows=2,           # skip header rows
    na_values=['N/A', '-']# custom null markers
)

# ── Excel ──
df = pd.read_excel('report.xlsx', sheet_name='Sales', engine='openpyxl')
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)  # dict of DFs

# ── Parquet (preferred for large data) ──
df = pd.read_parquet('data.parquet', columns=['col1', 'col2'])

# ── JSON ──
df = pd.read_json('data.json', orient='records')

# ── SQL ──
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://user:pw@host/db')
df = pd.read_sql("SELECT * FROM orders WHERE status = 'active'", engine)

# ── URL / remote ──
df = pd.read_csv('https://example.com/data.csv')

Writing Data

python

df.to_csv('output.csv',     index=False)
df.to_excel('output.xlsx',  index=False, sheet_name='Results')
df.to_parquet('output.parquet', compression='snappy')
df.to_json('output.json',   orient='records', indent=2)
df.to_sql('table_name', engine, if_exists='replace', index=False)

# Multiple sheets in one Excel file
with pd.ExcelWriter('multi.xlsx', engine='openpyxl') as writer:
    df_sales.to_excel(writer, sheet_name='Sales',   index=False)
    df_costs.to_excel(writer, sheet_name='Costs',   index=False)
    df_kpi.to_excel(writer,   sheet_name='KPIs',    index=False)

✅

Use Parquet for large datasets. It is 3–10× faster to read than CSV, stores data types, supports column pruning, and is the default format for most modern data platforms (Spark, DuckDB, Snowflake).

🔍

Inspect & Profile

Understanding your DataFrame before doing anything

python

# ── Basic inspection ──
df.shape                     # (rows, cols)
df.dtypes                    # dtype per column
df.info(memory_usage='deep')  # dtypes + non-null counts + memory
df.describe()                 # count, mean, std, min, quartiles, max
df.describe(include='all')   # includes object/categorical columns
df.describe(include='object') # string columns only

# ── Missing value audit ──
df.isnull().sum()             # null count per column
df.isnull().mean() * 100      # null percentage per column
df.notnull().all(axis=1)      # True for rows with no nulls

# ── Unique values ──
df['dept'].unique()           # array of unique values
df['dept'].nunique()          # count of unique values
df['dept'].value_counts()     # frequency table
df['dept'].value_counts(normalize=True)  # proportions

# ── Column stats ──
df.corr(numeric_only=True)    # correlation matrix
df['salary'].skew()           # skewness
df['salary'].quantile([.25, .5, .75])

# ── Memory ──
df.memory_usage(deep=True)    # bytes per column
df.memory_usage(deep=True).sum() / 1024**2  # total MB

🎯

Selection & Indexing

loc, iloc, at, iat — the four indexers

.loc[] — Label-based

Select by row/column labels. Inclusive on both ends when slicing.

.iloc[] — Position-based

Select by integer position. Exclusive end, like standard Python slicing.

python

# ── Column selection ──
df['salary']               # single column → Series
df[['name', 'salary']]     # multiple columns → DataFrame

# ── .loc[row_label, col_label] ──
df.loc[0]                   # row 0 (by label)
df.loc[0:2]                 # rows 0,1,2 inclusive
df.loc[0, 'name']           # scalar value
df.loc[:, 'age':'dept']    # all rows, col slice by label
df.loc[df['age'] > 30]     # boolean mask as row selector
df.loc[df['age'] > 30, ['name', 'salary']]  # mask + cols

# ── .iloc[row_int, col_int] ──
df.iloc[0]                  # first row
df.iloc[0:3]                # rows 0,1,2 (exclusive end)
df.iloc[0, 2]               # row 0, col 2 (scalar)
df.iloc[-5:]                # last 5 rows
df.iloc[:, [0, 2, 3]]      # cols by position list

# ── Fast scalar access ──
df.at[0, 'name']             # label-based scalar (faster than loc)
df.iat[0, 0]                 # position-based scalar (fastest)

⚠️

Avoid chained indexing: df['col'][0] = value raises a SettingWithCopyWarning in pandas v2. Always use df.loc[0, 'col'] = value to write values back reliably.

🔽

Filtering

Boolean masks, query(), isin(), between()

python

# ── Boolean masks ──
df[df['age'] > 30]
df[(df['age'] > 30) & (df['dept'] == 'Eng')]
df[(df['salary'] >= 80000) | (df['dept'] == 'Finance')]
df[~(df['dept'] == 'Mktg')]         # NOT

# ── .query() — readable SQL-like strings ──
df.query("age > 30 and dept == 'Eng'")
df.query("salary.between(70000, 100000)")  # requires pandas >= 2.0
threshold = 80000
df.query("salary > @threshold")        # @ references Python vars

# ── .isin() — membership test ──
df[df['dept'].isin(['Eng', 'Finance'])]
df[~df['dept'].isin(['Mktg'])]         # exclude

# ── .between() ──
df[df['salary'].between(70000, 100000)]  # inclusive by default

# ── .str contains / startswith ──
df[df['name'].str.startswith('A')]
df[df['name'].str.contains('li', case=False, na=False)]

# ── Null filtering ──
df[df['salary'].notna()]               # exclude NaN rows
df[df['salary'].isna()]                # only NaN rows

🕳️

Missing Data

Detect, drop, fill, and interpolate NaN values

python

# ── Detect ──
df.isnull().sum()                          # nulls per column
(df.isnull().sum() / len(df) * 100).round(2)  # percentage

# ── Drop ──
df.dropna()                  # drop rows with any NaN
df.dropna(how='all')         # drop rows where ALL values are NaN
df.dropna(subset=['salary']) # drop rows where 'salary' is NaN
df.dropna(axis=1, thresh=3) # drop cols with fewer than 3 non-null values

# ── Fill ──
df.fillna(0)                  # fill all NaN with 0
df.fillna({'age': df['age'].median(), 'dept': 'Unknown'})
df['salary'].fillna(df['salary'].mean())
df.ffill()                    # forward fill (propagate last valid value)
df.bfill()                    # backward fill

# ── Interpolate (numeric columns) ──
df['price'].interpolate(method='linear')
df['price'].interpolate(method='time')     # requires DatetimeIndex

🏷️

Data Types

Casting, category dtype, nullable types, and Arrow backend

python

# ── Casting ──
df['age'] = df['age'].astype('int32')      # reduce memory
df['salary'] = df['salary'].astype('float32')
df['flag'] = df['flag'].astype('bool')

# ── Category dtype (for low-cardinality strings) ──
df['dept'] = df['dept'].astype('category')   # huge memory savings
df['dept'].cat.categories           # view categories
df['dept'].cat.add_categories('HR')

# ── Nullable integer / boolean (pandas v1+) ──
df['age'] = df['age'].astype('Int64')    # capital I — supports NaN
df['active'] = df['active'].astype('boolean')

# ── Arrow-backed dtypes (pandas v2+, requires pyarrow) ──
df = df.convert_dtypes(dtype_backend='pyarrow')  # fast + null-safe

# ── Bulk downcasting ──
df = df.apply(pd.to_numeric, errors='ignore')  # coerce numeric cols

# ── Type-safe numeric conversion ──
df['score'] = pd.to_numeric(df['score'], errors='coerce')  # bad values → NaN
df['created'] = pd.to_datetime(df['created'], errors='coerce')

🔧

Apply & Map

apply, map, applymap, assign, pipe — transformation patterns

python

# ── Series.map() — element-wise on Series ──
df['dept_code'] = df['dept'].map({'Eng':1, 'Mktg':2, 'Finance':3})
df['name_upper'] = df['name'].map(str.upper)
df['tax'] = df['salary'].map(lambda x: x * 0.3)

# ── DataFrame.apply() — apply function along axis ──
df.apply(lambda col: col.max() - col.min(), axis=0)   # column-wise
df.apply(lambda row: row['salary'] / row['age'], axis=1) # row-wise

# ── DataFrame.map() (v2.1+, replaces applymap) — element-wise ──
df_nums.map(lambda x: round(x, 2))

# ── .assign() — add or overwrite columns, chainable ──
df = (df
    .assign(
        salary_k = df['salary'] / 1000,
        tax       = lambda d: d['salary'] * 0.3,
        net_pay   = lambda d: d['salary'] - d['tax'],
        seniority = lambda d: pd.cut(d['age'],
                        bins=[0,30,40,100],
                        labels=['junior','mid','senior'])
    )
)

# ── .pipe() — chain custom functions ──
def add_bonus(df, rate=0.1):
    return df.assign(bonus=df['salary'] * rate)

df = df.pipe(add_bonus, rate=0.15).pipe(some_other_fn)

⚡

Performance: Prefer vectorized operations (df['col'] * 2) over .apply() whenever possible. .apply() uses a Python loop internally and can be 10–100× slower on large DataFrames.

🔤

String Operations

The .str accessor — vectorized string methods

python

s = df['name']

# ── Case ──
s.str.lower()   /  s.str.upper()   /  s.str.title()

# ── Trim & pad ──
s.str.strip()                        # strip whitespace
s.str.lstrip('$') / s.str.rstrip()
s.str.pad(width=10, fillchar='0')   # zero-pad
s.str.zfill(5)

# ── Search & replace ──
s.str.contains('ali', case=False, na=False)  # → bool Series
s.str.startswith('A')
s.str.endswith('e')
s.str.replace('Eng', 'Engineering', regex=False)
s.str.replace(r'\d+', 'NUM', regex=True)         # regex

# ── Split & extract ──
s.str.split(',', expand=True)       # returns DataFrame
s.str.split(',').str[0]             # first part
s.str.extract(r'(\d{4})')           # capture group → column
s.str.extractall(r'(\w+)')         # all matches

# ── Length & count ──
s.str.len()
s.str.count('a')                    # count occurrences of 'a'

# ── Slice ──
s.str[0:3]                           # first 3 chars

🕒

DateTime

The .dt accessor, date ranges, offsets, and resampling

python

# ── Parse dates ──
df['date'] = pd.to_datetime(df['date_str'])
df['date'] = pd.to_datetime(df['date_str'], format='%d/%m/%Y')

# ── .dt accessor ──
df['year']    = df['date'].dt.year
df['month']   = df['date'].dt.month
df['day']     = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()   # 'Monday' etc.
df['quarter'] = df['date'].dt.quarter
df['week']    = df['date'].dt.isocalendar().week
df['is_month_end'] = df['date'].dt.is_month_end

# ── Date ranges ──
dates = pd.date_range('2024-01-01', periods=365, freq='D')
bdays = pd.bdate_range('2024-01-01', '2024-12-31')   # business days

# ── Timedelta arithmetic ──
df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days
df['deadline']  = df['date'] + pd.DateOffset(months=3)

# ── Timezone ──
df['date'] = df['date'].dt.tz_localize('UTC')
df['date'] = df['date'].dt.tz_convert('Asia/Kolkata')

📐

Reshape & Pivot

melt, pivot_table, stack, unstack, explode

python

# ── pivot_table — SQL GROUP BY equivalent ──
pivot = df.pivot_table(
    values='salary',
    index='dept',
    columns='seniority',
    aggfunc=['mean', 'count'],
    fill_value=0
)

# ── melt — wide to long (unpivot) ──
df_long = df.melt(
    id_vars=['name', 'dept'],
    value_vars=['q1_sales', 'q2_sales', 'q3_sales'],
    var_name='quarter',
    value_name='sales'
)

# ── pivot — long to wide ──
df_wide = df_long.pivot(index='name', columns='quarter', values='sales')

# ── stack / unstack ──
stacked   = df.stack()    # cols → row index level
unstacked = df.unstack()  # last index level → cols

# ── explode — lists in cells → rows ──
df_exp = df.explode('tags')  # df['tags'] contains lists

# ── crosstab — frequency table ──
ct = pd.crosstab(df['dept'], df['seniority'], normalize='index')

📊

GroupBy

Split-apply-combine — the most powerful pandas pattern

python

# ── Basic aggregation ──
df.groupby('dept')['salary'].mean()
df.groupby('dept')['salary'].agg(['mean', 'median', 'std', 'count'])

# ── Multiple columns & named aggs ──
df.groupby('dept').agg(
    avg_salary = ('salary', 'mean'),
    max_age    = ('age',    'max'),
    headcount  = ('name',   'count'),
    salary_std = ('salary', 'std'),
)

# ── Multiple group keys ──
df.groupby(['dept', 'seniority'])['salary'].mean()

# ── transform — broadcast result back to original shape ──
df['dept_avg']     = df.groupby('dept')['salary'].transform('mean')
df['salary_zscore'] = df.groupby('dept')['salary'].transform(
    lambda x: (x - x.mean()) / x.std()
)

# ── filter — keep groups that pass a condition ──
df.groupby('dept').filter(lambda g: g['salary'].mean() > 85000)

# ── apply — arbitrary function per group ──
def top_earners(group, n=2):
    return group.nlargest(n, 'salary')

df.groupby('dept', group_keys=False).apply(top_earners)

# ── Rank within group ──
df['dept_rank'] = df.groupby('dept')['salary'].rank(
    method='dense', ascending=False
)

🔗

Merge & Join

pd.merge, DataFrame.join, pd.concat

python

# ── pd.merge — SQL-style joins ──
# INNER JOIN (default)
result = pd.merge(df_left, df_right, on='id')

# LEFT / RIGHT / OUTER JOIN
result = pd.merge(df_left, df_right, on='id', how='left')
result = pd.merge(df_left, df_right, on='id', how='outer')

# Different column names
result = pd.merge(df_orders, df_customers,
    left_on='customer_id', right_on='id', how='left')

# Multi-key join
result = pd.merge(df1, df2, on=['dept', 'year'])

# Suffix for overlapping column names
result = pd.merge(df1, df2, on='id', suffixes=('_prev', '_curr'))

# ── pd.concat — stacking DataFrames ──
df_all = pd.concat([df1, df2, df3], ignore_index=True)   # row-wise
df_wide = pd.concat([df1, df2], axis=1)                  # column-wise
df_all = pd.concat([df1, df2], keys=['2023', '2024'])    # multi-index

# ── Validate joins (detect duplicates / missing) ──
pd.merge(df1, df2, on='id', validate='m:1')  # raises if not many-to-one
pd.merge(df1, df2, on='id', indicator=True)   # adds _merge column

🪟

Window Functions

rolling, expanding, ewm — moving statistics

python

# ── Rolling window ──
df['ma7']    = df['revenue'].rolling(window=7).mean()          # 7-day MA
df['roll_std'] = df['revenue'].rolling(30, min_periods=7).std()
df['ma30']   = df['revenue'].rolling('30D').mean()           # time-offset

# ── Expanding window (cumulative) ──
df['cum_max'] = df['revenue'].expanding().max()
df['cum_sum'] = df['revenue'].cumsum()
df['cum_pct'] = df['revenue'].cumsum() / df['revenue'].sum()

# ── Exponentially Weighted Moving Average ──
df['ewma'] = df['revenue'].ewm(span=12, adjust=False).mean()

# ── Rolling corr / cov ──
df['roll_corr'] = df['a'].rolling(60).corr(df['b'])

# ── Resample time series ──
df_monthly = df.resample('ME').sum()      # month-end
df_weekly  = df.resample('W').agg({      # per-column agg
    'revenue': 'sum',
    'users':   'mean',
})

📈

Use Case: EDA Workflow

Systematic exploratory data analysis on a new dataset

python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('sales.csv', parse_dates=['date'])

# ── 1. Shape & types ──
print(df.shape)
print(df.dtypes)

# ── 2. Missing value heatmap ──
missing = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print(missing[missing > 0])

# ── 3. Summary stats ──
print(df.describe(percentiles=[.05, .25, .5, .75, .95]))

# ── 4. Target distribution ──
df['revenue'].plot(kind='hist', bins=50, title='Revenue Distribution')
plt.show()

# ── 5. Categorical value counts ──
for col in df.select_dtypes('object').columns:
    print(df[col].value_counts().head(10))

# ── 6. Correlation matrix ──
corr = df.corr(numeric_only=True)
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.show()

# ── 7. Temporal trend ──
df.set_index('date')['revenue'].resample('ME').sum().plot(title='Monthly Revenue')
plt.show()

🧹

Use Case: Data Cleaning Pipeline

Production-grade cleaning with method chaining

Build a reproducible cleaning pipeline using method chaining with .pipe() — each function has a single responsibility and can be tested independently.

python

import pandas as pd
import numpy as np

def clean_column_names(df):
    """Lowercase, strip, replace spaces with underscores."""
    df.columns = (df.columns
        .str.strip()
        .str.lower()
        .str.replace(r'[\s\-/]+', '_', regex=True)
        .str.replace(r'[^a-z0-9_]', '', regex=True)
    )
    return df

def drop_high_null_cols(df, threshold=0.8):
    """Drop columns with more than `threshold` fraction of nulls."""
    null_frac = df.isnull().mean()
    return df.drop(columns=null_frac[null_frac > threshold].index)

def deduplicate(df, subset=None):
    """Remove duplicate rows."""
    before = len(df)
    df = df.drop_duplicates(subset=subset)
    print(f"Removed {before - len(df)} duplicates")
    return df

def fix_dtypes(df):
    """Infer and cast types."""
    df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')
    df['amount']     = pd.to_numeric(df['amount'], errors='coerce')
    df['status']     = df['status'].astype('category')
    return df

def remove_outliers(df, col, n_std=3):
    """Remove rows where `col` is beyond n_std standard deviations."""
    mean, std = df[col].mean(), df[col].std()
    return df[(df[col] >= mean - n_std * std) & (df[col] <= mean + n_std * std)]

# ── Apply the cleaning pipeline ──
df_clean = (
    pd.read_csv('raw_data.csv')
    .pipe(clean_column_names)
    .pipe(drop_high_null_cols, threshold=0.7)
    .pipe(deduplicate)
    .pipe(fix_dtypes)
    .pipe(remove_outliers, col='amount')
    .dropna(subset=['amount', 'created_at'])
    .reset_index(drop=True)
)
df_clean.to_parquet('clean_data.parquet')

📅

Use Case: Time Series Analysis

Daily sales data — resampling, rolling stats, YoY comparison

python

import pandas as pd

df = pd.read_csv('daily_sales.csv', parse_dates=['date'], index_col='date')

# ── Resampling ──
weekly  = df['revenue'].resample('W').sum()
monthly = df['revenue'].resample('ME').agg(['sum', 'mean', 'max'])

# ── Moving averages ──
df['ma7']  = df['revenue'].rolling(7).mean()
df['ma30'] = df['revenue'].rolling(30).mean()

# ── YoY Growth ──
monthly['yoy'] = monthly['sum'].pct_change(periods=12) * 100

# ── Lag features (for ML) ──
df['lag_1']   = df['revenue'].shift(1)
df['lag_7']   = df['revenue'].shift(7)
df['lag_30']  = df['revenue'].shift(30)

# ── Day-of-week patterns ──
df['dow'] = df.index.day_name()
dow_avg = df.groupby('dow')['revenue'].mean().reindex(
    ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
)

# ── Fill missing dates (business days) ──
full_idx = pd.bdate_range(df.index.min(), df.index.max())
df = df.reindex(full_idx).ffill()

📋

Use Case: Reporting & Export

Styled Excel reports, aggregated summaries, formatted output

python

import pandas as pd

# ── Build summary table ──
summary = (df
    .groupby(['region', 'product'])
    .agg(
        total_revenue = ('revenue', 'sum'),
        avg_order     = ('revenue', 'mean'),
        order_count   = ('order_id', 'count'),
    )
    .reset_index()
    .sort_values('total_revenue', ascending=False)
)

# ── Styled DataFrame (Jupyter/HTML) ──
(summary.style
    .format({'total_revenue': '${:,.0f}', 'avg_order': '${:,.2f}'})
    .background_gradient(subset=['total_revenue'], cmap='YlGn')
    .bar(subset=['order_count'], color='#5dbcd2')
    .highlight_max(subset=['total_revenue'], color='#ffe0b2')
    .set_caption('Sales Summary by Region & Product')
)

# ── Multi-sheet Excel export ──
with pd.ExcelWriter('monthly_report.xlsx', engine='openpyxl') as writer:
    summary.to_excel(writer, sheet_name='Summary',  index=False)
    df.to_excel(writer,      sheet_name='Raw Data', index=False)
    dow_avg.to_excel(writer, sheet_name='DOW Patterns')

# ── Console-friendly display ──
pd.set_option('display.float_format', '{:,.2f}'.format)
pd.set_option('display.max_columns', 20)
pd.set_option('display.width', 120)

⚡

Performance Tips

Memory optimization, vectorization, and when to leave pandas

Memory Reduction

Use category for low-cardinality strings
Downcast numerics: int64→int32, float64→float32
Use read_csv(usecols=...) to load only needed cols
Read in chunks: chunksize=100_000
Store as Parquet, not CSV

Speed Patterns

Vectorized ops > .apply() > Python loops
.query() faster than boolean indexing on large DFs
Use numba or cython for custom row functions
Use HistGradientBoosting (accepts NaN natively)
Consider DuckDB, Polars, or Dask for >1GB data

python

# ── Read CSV in chunks ──
chunks = pd.read_csv('huge.csv', chunksize=100_000)
result = pd.concat(
    [chunk[chunk['status'] == 'active'] for chunk in chunks],
    ignore_index=True
)

# ── Downcasting loop ──
for col in df.select_dtypes(['int64']).columns:
    df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes(['float64']).columns:
    df[col] = pd.to_numeric(df[col], downcast='float')

# ── np.select over nested apply ──
conditions = [df['salary'] > 100000, df['salary'] > 70000]
choices    = ['high', 'mid']
df['band'] = np.select(conditions, choices, default='low')

# ── eval() for large expression chains ──
df.eval('net = revenue - cost', inplace=True)
df.eval('margin = (revenue - cost) / revenue * 100', inplace=True)

🚀

Beyond pandas: For datasets > 1GB, consider Polars (Rust-based, lazy execution, 5–20× faster), DuckDB (SQL on DataFrames in-process), or Dask (distributed pandas). All export back to pandas DataFrames seamlessly.

📚

API Reference

Core methods organized by category

DataFrame — Essential Methods

Method	Category	Description
df.head(n) / df.tail(n)	Inspect	First / last n rows
df.info()	Inspect	Schema, dtypes, null counts, memory
df.describe()	Inspect	Summary statistics
df.shape / df.dtypes	Inspect	Dimensions and column types
df.loc[] / df.iloc[]	Select	Label / position indexing
df.query()	Select	SQL-like filter expression
df.assign()	Transform	Add/overwrite columns (chainable)
df.apply()	Transform	Apply function along axis
df.map()	Transform	Element-wise function (v2.1+)
df.pipe()	Transform	Chain custom functions
df.rename()	Transform	Rename columns / index
df.drop()	Transform	Remove rows or columns
df.sort_values()	Sort	Sort by one or more columns
df.sort_index()	Sort	Sort by index
df.groupby()	Aggregate	Split-apply-combine
df.pivot_table()	Aggregate	Spreadsheet-style pivot
df.melt()	Reshape	Wide → long (unpivot)
df.fillna() / df.dropna()	Missing	Fill or remove NaN
df.drop_duplicates()	Clean	Remove duplicate rows
df.astype()	Dtypes	Cast column dtype
df.merge()	Combine	SQL-style join
df.set_index() / df.reset_index()	Index	Promote column to index / vice versa
df.rolling() / df.expanding()	Window	Sliding / cumulative window
df.resample()	Window	Time-based resampling
df.to_csv() / df.to_parquet()	I/O	Write to file
df.style	Display	Styled Jupyter / HTML output
df.memory_usage()	Perf	Memory per column
df.eval()	Perf	Evaluate expression (fast)

Useful Top-Level Functions

Function	Description
pd.read_csv() / pd.read_excel()	Read tabular files
pd.read_parquet() / pd.read_sql()	Read Parquet / SQL databases
pd.merge()	SQL-style join of two DataFrames
pd.concat()	Stack DataFrames row- or column-wise
pd.to_datetime()	Parse strings/numbers to datetime
pd.to_numeric()	Coerce column to numeric
pd.get_dummies()	One-hot encode categorical columns
pd.cut() / pd.qcut()	Bin continuous values (uniform / quantile)
pd.crosstab()	Frequency cross-tabulation
pd.date_range() / pd.bdate_range()	Generate date sequences
pd.DataFrame.from_records()	Build from list of tuples/dicts
pd.json_normalize()	Flatten nested JSON to DataFrame
pd.options / pd.set_option()	Configure display and behavior

🗒️

Cheat Sheet

Quick-reference one-liners for daily pandas work

Top / Bottom N per Group

# Top 3 salaries per dept
df.groupby('dept', group_keys=False).apply(
    lambda g: g.nlargest(3, 'salary')
)

JSON Normalize (Nested)

from pandas import json_normalize
df = json_normalize(
  data, record_path='items',
  meta=['id', ['user','name']]
)

Conditional Column (np.where)

import numpy as np
df['flag'] = np.where(
    df['revenue'] > 1000,
    'high', 'low'
)

One-Hot Encode

df = pd.get_dummies(
    df, columns=['dept'],
    prefix='dept',
    dtype=int
)

Running Percentage

df['pct_total'] = (
    df['revenue']
    .cumsum()
    / df['revenue'].sum()
    * 100
)

Flatten MultiIndex Cols

df.columns = [
    '_'.join(col).strip('_')
    for col in df.columns.values
]

📚

Further Reading: Official docs at pandas.pydata.org. For interactive EDA, try ydata-profiling (ProfileReport(df)). For faster groupby/merge on large data consider polars, dask, or duckdb.query("SELECT ... FROM df").

PandasHandbook

Introduction

Installation

Optional Dependencies by Task

Quick Start

Sample DataFrame Output

Data Structures

Read & Write

Reading Data

Writing Data

Inspect & Profile

Selection & Indexing

Filtering

Missing Data

Data Types

Apply & Map

String Operations

DateTime

Reshape & Pivot

GroupBy

Merge & Join

Window Functions

Use Case: EDA Workflow

Use Case: Data Cleaning Pipeline

Use Case: Time Series Analysis

Use Case: Reporting & Export

Performance Tips

API Reference

DataFrame — Essential Methods

Useful Top-Level Functions

Cheat Sheet

Pandas
Handbook