Pandas
Handbook
A comprehensive reference for data scientists and analysts β covering DataFrames, Series, I/O, cleaning, transformation, aggregation, time series, and production patterns with pandas v2.x.
Introduction
Pandas is the foundational library for data analysis and manipulation in Python. Built on top of NumPy, it provides two primary data structures β DataFrame (2D table) and Series (1D labeled array) β along with a vast toolkit for reading, cleaning, transforming, and aggregating data.
Rows and columns have labels (indices), making data alignment automatic and joins intuitive β unlike raw NumPy arrays.
From method chaining to SQL-like queries, pandas offers multiple styles. Operations that take dozens of lines in other languages are one-liners.
Every major Python ML library β scikit-learn, XGBoost, Plotly, seaborn β accepts DataFrames as first-class input.
pd.options.mode.copy_on_write = True to opt-in early.Installation
Optional Dependencies by Task
| Task | Package(s) |
|---|---|
| Excel read/write | openpyxl, xlrd, xlwt |
| Parquet | pyarrow or fastparquet |
| SQL databases | sqlalchemy, pymysql, psycopg2 |
| Arrow-backed dtypes | pyarrow β₯ 7.0 |
| HTML parsing | lxml, html5lib, beautifulsoup4 |
| HDF5 | tables (PyTables) |
| Compression | zlib, bz2, lz4 (stdlib or pip) |
Quick Start
Sample DataFrame Output
| name | age | salary | dept | |
|---|---|---|---|---|
| 0 | Alice | 29 | 82000 | Eng |
| 1 | Bob | 34 | 95000 | Mktg |
| 2 | Carol | 27 | 71000 | Eng |
| 3 | Dave | 41 | 110000 | Finance |
Data Structures
A 2D labeled table β the core data structure. Think of it as a dict of Series sharing the same index. Columns can be different dtypes.
df.columnsβ column labelsdf.indexβ row labels (RangeIndex by default)df.valuesβ underlying NumPy arraydf.shapeβ (rows, cols) tuple
A 1D labeled array β a single column (or row) of a DataFrame. Has both values and an index.
s.valuesβ underlying arrays.indexβ labelss.nameβ column name when extracted from dfs.dtypeβ data type
Read & Write
Reading Data
Writing Data
Inspect & Profile
Selection & Indexing
.loc[] β Label-basedSelect by row/column labels. Inclusive on both ends when slicing.
.iloc[] β Position-basedSelect by integer position. Exclusive end, like standard Python slicing.
df['col'][0] = value raises a SettingWithCopyWarning in pandas v2. Always use df.loc[0, 'col'] = value to write values back reliably.Filtering
Missing Data
Data Types
Apply & Map
df['col'] * 2) over .apply() whenever possible. .apply() uses a Python loop internally and can be 10β100Γ slower on large DataFrames.String Operations
DateTime
Reshape & Pivot
GroupBy
Merge & Join
Window Functions
Use Case: EDA Workflow
Use Case: Data Cleaning Pipeline
Build a reproducible cleaning pipeline using method chaining with .pipe() β each function has a single responsibility and can be tested independently.
Use Case: Time Series Analysis
Use Case: Reporting & Export
Performance Tips
- Use
categoryfor low-cardinality strings - Downcast numerics:
int64βint32,float64βfloat32 - Use
read_csv(usecols=...)to load only needed cols - Read in chunks:
chunksize=100_000 - Store as Parquet, not CSV
- Vectorized ops >
.apply()> Python loops .query()faster than boolean indexing on large DFs- Use
numbaorcythonfor custom row functions - Use
HistGradientBoosting(accepts NaN natively) - Consider DuckDB, Polars, or Dask for >1GB data
API Reference
DataFrame β Essential Methods
| Method | Category | Description |
|---|---|---|
| df.head(n) / df.tail(n) | Inspect | First / last n rows |
| df.info() | Inspect | Schema, dtypes, null counts, memory |
| df.describe() | Inspect | Summary statistics |
| df.shape / df.dtypes | Inspect | Dimensions and column types |
| df.loc[] / df.iloc[] | Select | Label / position indexing |
| df.query() | Select | SQL-like filter expression |
| df.assign() | Transform | Add/overwrite columns (chainable) |
| df.apply() | Transform | Apply function along axis |
| df.map() | Transform | Element-wise function (v2.1+) |
| df.pipe() | Transform | Chain custom functions |
| df.rename() | Transform | Rename columns / index |
| df.drop() | Transform | Remove rows or columns |
| df.sort_values() | Sort | Sort by one or more columns |
| df.sort_index() | Sort | Sort by index |
| df.groupby() | Aggregate | Split-apply-combine |
| df.pivot_table() | Aggregate | Spreadsheet-style pivot |
| df.melt() | Reshape | Wide β long (unpivot) |
| df.fillna() / df.dropna() | Missing | Fill or remove NaN |
| df.drop_duplicates() | Clean | Remove duplicate rows |
| df.astype() | Dtypes | Cast column dtype |
| df.merge() | Combine | SQL-style join |
| df.set_index() / df.reset_index() | Index | Promote column to index / vice versa |
| df.rolling() / df.expanding() | Window | Sliding / cumulative window |
| df.resample() | Window | Time-based resampling |
| df.to_csv() / df.to_parquet() | I/O | Write to file |
| df.style | Display | Styled Jupyter / HTML output |
| df.memory_usage() | Perf | Memory per column |
| df.eval() | Perf | Evaluate expression (fast) |
Useful Top-Level Functions
| Function | Description |
|---|---|
| pd.read_csv() / pd.read_excel() | Read tabular files |
| pd.read_parquet() / pd.read_sql() | Read Parquet / SQL databases |
| pd.merge() | SQL-style join of two DataFrames |
| pd.concat() | Stack DataFrames row- or column-wise |
| pd.to_datetime() | Parse strings/numbers to datetime |
| pd.to_numeric() | Coerce column to numeric |
| pd.get_dummies() | One-hot encode categorical columns |
| pd.cut() / pd.qcut() | Bin continuous values (uniform / quantile) |
| pd.crosstab() | Frequency cross-tabulation |
| pd.date_range() / pd.bdate_range() | Generate date sequences |
| pd.DataFrame.from_records() | Build from list of tuples/dicts |
| pd.json_normalize() | Flatten nested JSON to DataFrame |
| pd.options / pd.set_option() | Configure display and behavior |
Cheat Sheet
pandas.pydata.org. For interactive EDA, try ydata-profiling (ProfileReport(df)). For faster groupby/merge on large data consider polars, dask, or duckdb.query("SELECT ... FROM df").