Back to handbooks index
Data Analysis in Python

Pandas
Handbook

A comprehensive reference for data scientists and analysts β€” covering DataFrames, Series, I/O, cleaning, transformation, aggregation, time series, and production patterns with pandas v2.x.

DataFrame & Series Data Cleaning GroupBy & Aggregation Time Series Merge & Join EDA
πŸ“–

Introduction

What pandas is, when to use it, and its core philosophy

Pandas is the foundational library for data analysis and manipulation in Python. Built on top of NumPy, it provides two primary data structures β€” DataFrame (2D table) and Series (1D labeled array) β€” along with a vast toolkit for reading, cleaning, transforming, and aggregating data.

Labeled Data

Rows and columns have labels (indices), making data alignment automatic and joins intuitive β€” unlike raw NumPy arrays.

Expressive API

From method chaining to SQL-like queries, pandas offers multiple styles. Operations that take dozens of lines in other languages are one-liners.

Ecosystem Hub

Every major Python ML library β€” scikit-learn, XGBoost, Plotly, seaborn β€” accepts DataFrames as first-class input.

πŸ†•
Pandas v2.x: Introduces Copy-on-Write (CoW) semantics for better memory behavior, improved Arrow-backed dtypes, and nullable integer/boolean types. Set pd.options.mode.copy_on_write = True to opt-in early.
πŸ“¦

Installation

pip, conda, and optional dependencies
bash
# Minimal install pip install pandas # Full data science stack pip install pandas numpy matplotlib seaborn openpyxl xlrd pyarrow fastparquet # All optional dependencies pip install "pandas[all]" # Conda conda install -c conda-forge pandas

Optional Dependencies by Task

TaskPackage(s)
Excel read/writeopenpyxl, xlrd, xlwt
Parquetpyarrow or fastparquet
SQL databasessqlalchemy, pymysql, psycopg2
Arrow-backed dtypespyarrow β‰₯ 7.0
HTML parsinglxml, html5lib, beautifulsoup4
HDF5tables (PyTables)
Compressionzlib, bz2, lz4 (stdlib or pip)
python
import pandas as pd import numpy as np print(pd.__version__) # e.g. 2.2.1 pd.show_versions() # full dependency report
⚑

Quick Start

Create, load, explore β€” your first DataFrame in minutes
python
import pandas as pd import numpy as np # ── Create from a dictionary ── df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Carol', 'Dave'], 'age': [29, 34, 27, 41], 'salary': [82000, 95000, 71000, 110000], 'dept': ['Eng', 'Mktg', 'Eng', 'Finance'], }) print(df) # name age salary dept # 0 Alice 29 82000 Eng # 1 Bob 34 95000 Mktg # 2 Carol 27 71000 Eng # 3 Dave 41 110000 Finance # ── Quick exploration ── df.head(3) # first 3 rows df.tail(2) # last 2 rows df.shape # (4, 4) df.dtypes # column types df.describe() # summary statistics df.info() # memory, nulls, dtypes

Sample DataFrame Output

  nameagesalarydept
0Alice2982000Eng
1Bob3495000Mktg
2Carol2771000Eng
3Dave41110000Finance
dtypes: object(2), int64(2) Β· memory: 352 bytes
🧱

Data Structures

DataFrame, Series, Index β€” and how they relate
DataFrame

A 2D labeled table β€” the core data structure. Think of it as a dict of Series sharing the same index. Columns can be different dtypes.

  • df.columns β€” column labels
  • df.index β€” row labels (RangeIndex by default)
  • df.values β€” underlying NumPy array
  • df.shape β€” (rows, cols) tuple
Series

A 1D labeled array β€” a single column (or row) of a DataFrame. Has both values and an index.

  • s.values β€” underlying array
  • s.index β€” labels
  • s.name β€” column name when extracted from df
  • s.dtype β€” data type
python
# ── Series ── s = pd.Series([10, 20, 30], index=['a', 'b', 'c'], name='values') s['b'] # β†’ 20 s[s > 15] # β†’ Series(b=20, c=30) # ── DataFrame creation patterns ── # From list of dicts df = pd.DataFrame([{'x':1, 'y':2}, {'x':3, 'y':4}]) # From NumPy array df = pd.DataFrame(np.random.randn(5, 3), columns=['A', 'B', 'C']) # With custom index df = pd.DataFrame(data, index=pd.date_range('2024-01-01', periods=5)) # Column extraction returns a Series s_age = df['age'] # Series df_sub = df[['name', 'salary']] # DataFrame (double brackets)
πŸ’Ύ

Read & Write

CSV, Excel, JSON, Parquet, SQL, HTML, Clipboard

Reading Data

python
# ── CSV ── df = pd.read_csv('data.csv') df = pd.read_csv('data.csv', sep=';', # delimiter header=0, # row to use as header index_col='id', # set as index usecols=['a','b'], # load only these columns dtype={'id':'Int64'}, # explicit types parse_dates=['date'], # parse date columns nrows=10000, # read first N rows skiprows=2, # skip header rows na_values=['N/A', '-']# custom null markers ) # ── Excel ── df = pd.read_excel('report.xlsx', sheet_name='Sales', engine='openpyxl') all_sheets = pd.read_excel('report.xlsx', sheet_name=None) # dict of DFs # ── Parquet (preferred for large data) ── df = pd.read_parquet('data.parquet', columns=['col1', 'col2']) # ── JSON ── df = pd.read_json('data.json', orient='records') # ── SQL ── from sqlalchemy import create_engine engine = create_engine('postgresql+psycopg2://user:pw@host/db') df = pd.read_sql("SELECT * FROM orders WHERE status = 'active'", engine) # ── URL / remote ── df = pd.read_csv('https://example.com/data.csv')

Writing Data

python
df.to_csv('output.csv', index=False) df.to_excel('output.xlsx', index=False, sheet_name='Results') df.to_parquet('output.parquet', compression='snappy') df.to_json('output.json', orient='records', indent=2) df.to_sql('table_name', engine, if_exists='replace', index=False) # Multiple sheets in one Excel file with pd.ExcelWriter('multi.xlsx', engine='openpyxl') as writer: df_sales.to_excel(writer, sheet_name='Sales', index=False) df_costs.to_excel(writer, sheet_name='Costs', index=False) df_kpi.to_excel(writer, sheet_name='KPIs', index=False)
βœ…
Use Parquet for large datasets. It is 3–10Γ— faster to read than CSV, stores data types, supports column pruning, and is the default format for most modern data platforms (Spark, DuckDB, Snowflake).
πŸ”

Inspect & Profile

Understanding your DataFrame before doing anything
python
# ── Basic inspection ── df.shape # (rows, cols) df.dtypes # dtype per column df.info(memory_usage='deep') # dtypes + non-null counts + memory df.describe() # count, mean, std, min, quartiles, max df.describe(include='all') # includes object/categorical columns df.describe(include='object') # string columns only # ── Missing value audit ── df.isnull().sum() # null count per column df.isnull().mean() * 100 # null percentage per column df.notnull().all(axis=1) # True for rows with no nulls # ── Unique values ── df['dept'].unique() # array of unique values df['dept'].nunique() # count of unique values df['dept'].value_counts() # frequency table df['dept'].value_counts(normalize=True) # proportions # ── Column stats ── df.corr(numeric_only=True) # correlation matrix df['salary'].skew() # skewness df['salary'].quantile([.25, .5, .75]) # ── Memory ── df.memory_usage(deep=True) # bytes per column df.memory_usage(deep=True).sum() / 1024**2 # total MB
🎯

Selection & Indexing

loc, iloc, at, iat β€” the four indexers
.loc[] β€” Label-based

Select by row/column labels. Inclusive on both ends when slicing.

.iloc[] β€” Position-based

Select by integer position. Exclusive end, like standard Python slicing.

python
# ── Column selection ── df['salary'] # single column β†’ Series df[['name', 'salary']] # multiple columns β†’ DataFrame # ── .loc[row_label, col_label] ── df.loc[0] # row 0 (by label) df.loc[0:2] # rows 0,1,2 inclusive df.loc[0, 'name'] # scalar value df.loc[:, 'age':'dept'] # all rows, col slice by label df.loc[df['age'] > 30] # boolean mask as row selector df.loc[df['age'] > 30, ['name', 'salary']] # mask + cols # ── .iloc[row_int, col_int] ── df.iloc[0] # first row df.iloc[0:3] # rows 0,1,2 (exclusive end) df.iloc[0, 2] # row 0, col 2 (scalar) df.iloc[-5:] # last 5 rows df.iloc[:, [0, 2, 3]] # cols by position list # ── Fast scalar access ── df.at[0, 'name'] # label-based scalar (faster than loc) df.iat[0, 0] # position-based scalar (fastest)
⚠️
Avoid chained indexing: df['col'][0] = value raises a SettingWithCopyWarning in pandas v2. Always use df.loc[0, 'col'] = value to write values back reliably.
πŸ”½

Filtering

Boolean masks, query(), isin(), between()
python
# ── Boolean masks ── df[df['age'] > 30] df[(df['age'] > 30) & (df['dept'] == 'Eng')] df[(df['salary'] >= 80000) | (df['dept'] == 'Finance')] df[~(df['dept'] == 'Mktg')] # NOT # ── .query() β€” readable SQL-like strings ── df.query("age > 30 and dept == 'Eng'") df.query("salary.between(70000, 100000)") # requires pandas >= 2.0 threshold = 80000 df.query("salary > @threshold") # @ references Python vars # ── .isin() β€” membership test ── df[df['dept'].isin(['Eng', 'Finance'])] df[~df['dept'].isin(['Mktg'])] # exclude # ── .between() ── df[df['salary'].between(70000, 100000)] # inclusive by default # ── .str contains / startswith ── df[df['name'].str.startswith('A')] df[df['name'].str.contains('li', case=False, na=False)] # ── Null filtering ── df[df['salary'].notna()] # exclude NaN rows df[df['salary'].isna()] # only NaN rows
πŸ•³οΈ

Missing Data

Detect, drop, fill, and interpolate NaN values
python
# ── Detect ── df.isnull().sum() # nulls per column (df.isnull().sum() / len(df) * 100).round(2) # percentage # ── Drop ── df.dropna() # drop rows with any NaN df.dropna(how='all') # drop rows where ALL values are NaN df.dropna(subset=['salary']) # drop rows where 'salary' is NaN df.dropna(axis=1, thresh=3) # drop cols with fewer than 3 non-null values # ── Fill ── df.fillna(0) # fill all NaN with 0 df.fillna({'age': df['age'].median(), 'dept': 'Unknown'}) df['salary'].fillna(df['salary'].mean()) df.ffill() # forward fill (propagate last valid value) df.bfill() # backward fill # ── Interpolate (numeric columns) ── df['price'].interpolate(method='linear') df['price'].interpolate(method='time') # requires DatetimeIndex
🏷️

Data Types

Casting, category dtype, nullable types, and Arrow backend
python
# ── Casting ── df['age'] = df['age'].astype('int32') # reduce memory df['salary'] = df['salary'].astype('float32') df['flag'] = df['flag'].astype('bool') # ── Category dtype (for low-cardinality strings) ── df['dept'] = df['dept'].astype('category') # huge memory savings df['dept'].cat.categories # view categories df['dept'].cat.add_categories('HR') # ── Nullable integer / boolean (pandas v1+) ── df['age'] = df['age'].astype('Int64') # capital I β€” supports NaN df['active'] = df['active'].astype('boolean') # ── Arrow-backed dtypes (pandas v2+, requires pyarrow) ── df = df.convert_dtypes(dtype_backend='pyarrow') # fast + null-safe # ── Bulk downcasting ── df = df.apply(pd.to_numeric, errors='ignore') # coerce numeric cols # ── Type-safe numeric conversion ── df['score'] = pd.to_numeric(df['score'], errors='coerce') # bad values β†’ NaN df['created'] = pd.to_datetime(df['created'], errors='coerce')
πŸ”§

Apply & Map

apply, map, applymap, assign, pipe β€” transformation patterns
python
# ── Series.map() β€” element-wise on Series ── df['dept_code'] = df['dept'].map({'Eng':1, 'Mktg':2, 'Finance':3}) df['name_upper'] = df['name'].map(str.upper) df['tax'] = df['salary'].map(lambda x: x * 0.3) # ── DataFrame.apply() β€” apply function along axis ── df.apply(lambda col: col.max() - col.min(), axis=0) # column-wise df.apply(lambda row: row['salary'] / row['age'], axis=1) # row-wise # ── DataFrame.map() (v2.1+, replaces applymap) β€” element-wise ── df_nums.map(lambda x: round(x, 2)) # ── .assign() β€” add or overwrite columns, chainable ── df = (df .assign( salary_k = df['salary'] / 1000, tax = lambda d: d['salary'] * 0.3, net_pay = lambda d: d['salary'] - d['tax'], seniority = lambda d: pd.cut(d['age'], bins=[0,30,40,100], labels=['junior','mid','senior']) ) ) # ── .pipe() β€” chain custom functions ── def add_bonus(df, rate=0.1): return df.assign(bonus=df['salary'] * rate) df = df.pipe(add_bonus, rate=0.15).pipe(some_other_fn)
⚑
Performance: Prefer vectorized operations (df['col'] * 2) over .apply() whenever possible. .apply() uses a Python loop internally and can be 10–100Γ— slower on large DataFrames.
πŸ”€

String Operations

The .str accessor β€” vectorized string methods
python
s = df['name'] # ── Case ── s.str.lower() / s.str.upper() / s.str.title() # ── Trim & pad ── s.str.strip() # strip whitespace s.str.lstrip('$') / s.str.rstrip() s.str.pad(width=10, fillchar='0') # zero-pad s.str.zfill(5) # ── Search & replace ── s.str.contains('ali', case=False, na=False) # β†’ bool Series s.str.startswith('A') s.str.endswith('e') s.str.replace('Eng', 'Engineering', regex=False) s.str.replace(r'\d+', 'NUM', regex=True) # regex # ── Split & extract ── s.str.split(',', expand=True) # returns DataFrame s.str.split(',').str[0] # first part s.str.extract(r'(\d{4})') # capture group β†’ column s.str.extractall(r'(\w+)') # all matches # ── Length & count ── s.str.len() s.str.count('a') # count occurrences of 'a' # ── Slice ── s.str[0:3] # first 3 chars
πŸ•’

DateTime

The .dt accessor, date ranges, offsets, and resampling
python
# ── Parse dates ── df['date'] = pd.to_datetime(df['date_str']) df['date'] = pd.to_datetime(df['date_str'], format='%d/%m/%Y') # ── .dt accessor ── df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day df['weekday'] = df['date'].dt.day_name() # 'Monday' etc. df['quarter'] = df['date'].dt.quarter df['week'] = df['date'].dt.isocalendar().week df['is_month_end'] = df['date'].dt.is_month_end # ── Date ranges ── dates = pd.date_range('2024-01-01', periods=365, freq='D') bdays = pd.bdate_range('2024-01-01', '2024-12-31') # business days # ── Timedelta arithmetic ── df['days_since'] = (pd.Timestamp.now() - df['date']).dt.days df['deadline'] = df['date'] + pd.DateOffset(months=3) # ── Timezone ── df['date'] = df['date'].dt.tz_localize('UTC') df['date'] = df['date'].dt.tz_convert('Asia/Kolkata')
πŸ“

Reshape & Pivot

melt, pivot_table, stack, unstack, explode
python
# ── pivot_table β€” SQL GROUP BY equivalent ── pivot = df.pivot_table( values='salary', index='dept', columns='seniority', aggfunc=['mean', 'count'], fill_value=0 ) # ── melt β€” wide to long (unpivot) ── df_long = df.melt( id_vars=['name', 'dept'], value_vars=['q1_sales', 'q2_sales', 'q3_sales'], var_name='quarter', value_name='sales' ) # ── pivot β€” long to wide ── df_wide = df_long.pivot(index='name', columns='quarter', values='sales') # ── stack / unstack ── stacked = df.stack() # cols β†’ row index level unstacked = df.unstack() # last index level β†’ cols # ── explode β€” lists in cells β†’ rows ── df_exp = df.explode('tags') # df['tags'] contains lists # ── crosstab β€” frequency table ── ct = pd.crosstab(df['dept'], df['seniority'], normalize='index')
πŸ“Š

GroupBy

Split-apply-combine β€” the most powerful pandas pattern
python
# ── Basic aggregation ── df.groupby('dept')['salary'].mean() df.groupby('dept')['salary'].agg(['mean', 'median', 'std', 'count']) # ── Multiple columns & named aggs ── df.groupby('dept').agg( avg_salary = ('salary', 'mean'), max_age = ('age', 'max'), headcount = ('name', 'count'), salary_std = ('salary', 'std'), ) # ── Multiple group keys ── df.groupby(['dept', 'seniority'])['salary'].mean() # ── transform β€” broadcast result back to original shape ── df['dept_avg'] = df.groupby('dept')['salary'].transform('mean') df['salary_zscore'] = df.groupby('dept')['salary'].transform( lambda x: (x - x.mean()) / x.std() ) # ── filter β€” keep groups that pass a condition ── df.groupby('dept').filter(lambda g: g['salary'].mean() > 85000) # ── apply β€” arbitrary function per group ── def top_earners(group, n=2): return group.nlargest(n, 'salary') df.groupby('dept', group_keys=False).apply(top_earners) # ── Rank within group ── df['dept_rank'] = df.groupby('dept')['salary'].rank( method='dense', ascending=False )
πŸ”—

Merge & Join

pd.merge, DataFrame.join, pd.concat
python
# ── pd.merge β€” SQL-style joins ── # INNER JOIN (default) result = pd.merge(df_left, df_right, on='id') # LEFT / RIGHT / OUTER JOIN result = pd.merge(df_left, df_right, on='id', how='left') result = pd.merge(df_left, df_right, on='id', how='outer') # Different column names result = pd.merge(df_orders, df_customers, left_on='customer_id', right_on='id', how='left') # Multi-key join result = pd.merge(df1, df2, on=['dept', 'year']) # Suffix for overlapping column names result = pd.merge(df1, df2, on='id', suffixes=('_prev', '_curr')) # ── pd.concat β€” stacking DataFrames ── df_all = pd.concat([df1, df2, df3], ignore_index=True) # row-wise df_wide = pd.concat([df1, df2], axis=1) # column-wise df_all = pd.concat([df1, df2], keys=['2023', '2024']) # multi-index # ── Validate joins (detect duplicates / missing) ── pd.merge(df1, df2, on='id', validate='m:1') # raises if not many-to-one pd.merge(df1, df2, on='id', indicator=True) # adds _merge column
πŸͺŸ

Window Functions

rolling, expanding, ewm β€” moving statistics
python
# ── Rolling window ── df['ma7'] = df['revenue'].rolling(window=7).mean() # 7-day MA df['roll_std'] = df['revenue'].rolling(30, min_periods=7).std() df['ma30'] = df['revenue'].rolling('30D').mean() # time-offset # ── Expanding window (cumulative) ── df['cum_max'] = df['revenue'].expanding().max() df['cum_sum'] = df['revenue'].cumsum() df['cum_pct'] = df['revenue'].cumsum() / df['revenue'].sum() # ── Exponentially Weighted Moving Average ── df['ewma'] = df['revenue'].ewm(span=12, adjust=False).mean() # ── Rolling corr / cov ── df['roll_corr'] = df['a'].rolling(60).corr(df['b']) # ── Resample time series ── df_monthly = df.resample('ME').sum() # month-end df_weekly = df.resample('W').agg({ # per-column agg 'revenue': 'sum', 'users': 'mean', })
πŸ“ˆ

Use Case: EDA Workflow

Systematic exploratory data analysis on a new dataset
python
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('sales.csv', parse_dates=['date']) # ── 1. Shape & types ── print(df.shape) print(df.dtypes) # ── 2. Missing value heatmap ── missing = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False) print(missing[missing > 0]) # ── 3. Summary stats ── print(df.describe(percentiles=[.05, .25, .5, .75, .95])) # ── 4. Target distribution ── df['revenue'].plot(kind='hist', bins=50, title='Revenue Distribution') plt.show() # ── 5. Categorical value counts ── for col in df.select_dtypes('object').columns: print(df[col].value_counts().head(10)) # ── 6. Correlation matrix ── corr = df.corr(numeric_only=True) sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm') plt.show() # ── 7. Temporal trend ── df.set_index('date')['revenue'].resample('ME').sum().plot(title='Monthly Revenue') plt.show()
🧹

Use Case: Data Cleaning Pipeline

Production-grade cleaning with method chaining

Build a reproducible cleaning pipeline using method chaining with .pipe() β€” each function has a single responsibility and can be tested independently.

python
import pandas as pd import numpy as np def clean_column_names(df): """Lowercase, strip, replace spaces with underscores.""" df.columns = (df.columns .str.strip() .str.lower() .str.replace(r'[\s\-/]+', '_', regex=True) .str.replace(r'[^a-z0-9_]', '', regex=True) ) return df def drop_high_null_cols(df, threshold=0.8): """Drop columns with more than `threshold` fraction of nulls.""" null_frac = df.isnull().mean() return df.drop(columns=null_frac[null_frac > threshold].index) def deduplicate(df, subset=None): """Remove duplicate rows.""" before = len(df) df = df.drop_duplicates(subset=subset) print(f"Removed {before - len(df)} duplicates") return df def fix_dtypes(df): """Infer and cast types.""" df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce') df['amount'] = pd.to_numeric(df['amount'], errors='coerce') df['status'] = df['status'].astype('category') return df def remove_outliers(df, col, n_std=3): """Remove rows where `col` is beyond n_std standard deviations.""" mean, std = df[col].mean(), df[col].std() return df[(df[col] >= mean - n_std * std) & (df[col] <= mean + n_std * std)] # ── Apply the cleaning pipeline ── df_clean = ( pd.read_csv('raw_data.csv') .pipe(clean_column_names) .pipe(drop_high_null_cols, threshold=0.7) .pipe(deduplicate) .pipe(fix_dtypes) .pipe(remove_outliers, col='amount') .dropna(subset=['amount', 'created_at']) .reset_index(drop=True) ) df_clean.to_parquet('clean_data.parquet')
πŸ“…

Use Case: Time Series Analysis

Daily sales data β€” resampling, rolling stats, YoY comparison
python
import pandas as pd df = pd.read_csv('daily_sales.csv', parse_dates=['date'], index_col='date') # ── Resampling ── weekly = df['revenue'].resample('W').sum() monthly = df['revenue'].resample('ME').agg(['sum', 'mean', 'max']) # ── Moving averages ── df['ma7'] = df['revenue'].rolling(7).mean() df['ma30'] = df['revenue'].rolling(30).mean() # ── YoY Growth ── monthly['yoy'] = monthly['sum'].pct_change(periods=12) * 100 # ── Lag features (for ML) ── df['lag_1'] = df['revenue'].shift(1) df['lag_7'] = df['revenue'].shift(7) df['lag_30'] = df['revenue'].shift(30) # ── Day-of-week patterns ── df['dow'] = df.index.day_name() dow_avg = df.groupby('dow')['revenue'].mean().reindex( ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'] ) # ── Fill missing dates (business days) ── full_idx = pd.bdate_range(df.index.min(), df.index.max()) df = df.reindex(full_idx).ffill()
πŸ“‹

Use Case: Reporting & Export

Styled Excel reports, aggregated summaries, formatted output
python
import pandas as pd # ── Build summary table ── summary = (df .groupby(['region', 'product']) .agg( total_revenue = ('revenue', 'sum'), avg_order = ('revenue', 'mean'), order_count = ('order_id', 'count'), ) .reset_index() .sort_values('total_revenue', ascending=False) ) # ── Styled DataFrame (Jupyter/HTML) ── (summary.style .format({'total_revenue': '${:,.0f}', 'avg_order': '${:,.2f}'}) .background_gradient(subset=['total_revenue'], cmap='YlGn') .bar(subset=['order_count'], color='#5dbcd2') .highlight_max(subset=['total_revenue'], color='#ffe0b2') .set_caption('Sales Summary by Region & Product') ) # ── Multi-sheet Excel export ── with pd.ExcelWriter('monthly_report.xlsx', engine='openpyxl') as writer: summary.to_excel(writer, sheet_name='Summary', index=False) df.to_excel(writer, sheet_name='Raw Data', index=False) dow_avg.to_excel(writer, sheet_name='DOW Patterns') # ── Console-friendly display ── pd.set_option('display.float_format', '{:,.2f}'.format) pd.set_option('display.max_columns', 20) pd.set_option('display.width', 120)
⚑

Performance Tips

Memory optimization, vectorization, and when to leave pandas
Memory Reduction
  • Use category for low-cardinality strings
  • Downcast numerics: int64β†’int32, float64β†’float32
  • Use read_csv(usecols=...) to load only needed cols
  • Read in chunks: chunksize=100_000
  • Store as Parquet, not CSV
Speed Patterns
  • Vectorized ops > .apply() > Python loops
  • .query() faster than boolean indexing on large DFs
  • Use numba or cython for custom row functions
  • Use HistGradientBoosting (accepts NaN natively)
  • Consider DuckDB, Polars, or Dask for >1GB data
python
# ── Read CSV in chunks ── chunks = pd.read_csv('huge.csv', chunksize=100_000) result = pd.concat( [chunk[chunk['status'] == 'active'] for chunk in chunks], ignore_index=True ) # ── Downcasting loop ── for col in df.select_dtypes(['int64']).columns: df[col] = pd.to_numeric(df[col], downcast='integer') for col in df.select_dtypes(['float64']).columns: df[col] = pd.to_numeric(df[col], downcast='float') # ── np.select over nested apply ── conditions = [df['salary'] > 100000, df['salary'] > 70000] choices = ['high', 'mid'] df['band'] = np.select(conditions, choices, default='low') # ── eval() for large expression chains ── df.eval('net = revenue - cost', inplace=True) df.eval('margin = (revenue - cost) / revenue * 100', inplace=True)
πŸš€
Beyond pandas: For datasets > 1GB, consider Polars (Rust-based, lazy execution, 5–20Γ— faster), DuckDB (SQL on DataFrames in-process), or Dask (distributed pandas). All export back to pandas DataFrames seamlessly.
πŸ“š

API Reference

Core methods organized by category

DataFrame β€” Essential Methods

MethodCategoryDescription
df.head(n) / df.tail(n)InspectFirst / last n rows
df.info()InspectSchema, dtypes, null counts, memory
df.describe()InspectSummary statistics
df.shape / df.dtypesInspectDimensions and column types
df.loc[] / df.iloc[]SelectLabel / position indexing
df.query()SelectSQL-like filter expression
df.assign()TransformAdd/overwrite columns (chainable)
df.apply()TransformApply function along axis
df.map()TransformElement-wise function (v2.1+)
df.pipe()TransformChain custom functions
df.rename()TransformRename columns / index
df.drop()TransformRemove rows or columns
df.sort_values()SortSort by one or more columns
df.sort_index()SortSort by index
df.groupby()AggregateSplit-apply-combine
df.pivot_table()AggregateSpreadsheet-style pivot
df.melt()ReshapeWide β†’ long (unpivot)
df.fillna() / df.dropna()MissingFill or remove NaN
df.drop_duplicates()CleanRemove duplicate rows
df.astype()DtypesCast column dtype
df.merge()CombineSQL-style join
df.set_index() / df.reset_index()IndexPromote column to index / vice versa
df.rolling() / df.expanding()WindowSliding / cumulative window
df.resample()WindowTime-based resampling
df.to_csv() / df.to_parquet()I/OWrite to file
df.styleDisplayStyled Jupyter / HTML output
df.memory_usage()PerfMemory per column
df.eval()PerfEvaluate expression (fast)

Useful Top-Level Functions

FunctionDescription
pd.read_csv() / pd.read_excel()Read tabular files
pd.read_parquet() / pd.read_sql()Read Parquet / SQL databases
pd.merge()SQL-style join of two DataFrames
pd.concat()Stack DataFrames row- or column-wise
pd.to_datetime()Parse strings/numbers to datetime
pd.to_numeric()Coerce column to numeric
pd.get_dummies()One-hot encode categorical columns
pd.cut() / pd.qcut()Bin continuous values (uniform / quantile)
pd.crosstab()Frequency cross-tabulation
pd.date_range() / pd.bdate_range()Generate date sequences
pd.DataFrame.from_records()Build from list of tuples/dicts
pd.json_normalize()Flatten nested JSON to DataFrame
pd.options / pd.set_option()Configure display and behavior
πŸ—’οΈ

Cheat Sheet

Quick-reference one-liners for daily pandas work
Top / Bottom N per Group
# Top 3 salaries per dept df.groupby('dept', group_keys=False).apply( lambda g: g.nlargest(3, 'salary') )
JSON Normalize (Nested)
from pandas import json_normalize df = json_normalize( data, record_path='items', meta=['id', ['user','name']] )
Conditional Column (np.where)
import numpy as np df['flag'] = np.where( df['revenue'] > 1000, 'high', 'low' )
One-Hot Encode
df = pd.get_dummies( df, columns=['dept'], prefix='dept', dtype=int )
Running Percentage
df['pct_total'] = ( df['revenue'] .cumsum() / df['revenue'].sum() * 100 )
Flatten MultiIndex Cols
df.columns = [ '_'.join(col).strip('_') for col in df.columns.values ]
πŸ“š
Further Reading: Official docs at pandas.pydata.org. For interactive EDA, try ydata-profiling (ProfileReport(df)). For faster groupby/merge on large data consider polars, dask, or duckdb.query("SELECT ... FROM df").