Extract,
Transform,
Load
The backbone of every data-driven organisation
A practical, no-jargon guide to understanding, designing, and running ETL pipelines — from your first script to production-grade systems. Written for analysts, engineers, and anyone who's ever asked "where does this data actually come from?"
What Is ETL?
ETL stands for Extract, Transform, Load. It is the process of pulling raw data from one or more sources, reshaping it into a usable form, and storing it in a destination where it can be queried and analysed. Think of it as the plumbing of the data world — invisible when it works, catastrophic when it doesn't.
Every time you look at a business dashboard, run a report, or train a machine learning model, there is almost certainly an ETL pipeline somewhere in the chain that made that data available and coherent.
- Read from source
- Databases, APIs, files
- Raw, messy data
- No changes yet
- Clean & validate
- Reshape structure
- Enrich & aggregate
- Business logic
- Write to destination
- Data warehouse, lake
- Structured & ready
- Analysts can query
ETL vs ELT — a quick distinction
- Transform data before loading into the destination
- Lighter storage requirements — only clean data lands
- Better for sensitive data (scrub PII before it touches the warehouse)
- Traditional approach — suited to on-premise data warehouses
- Transformation logic lives in the pipeline (Python, Spark, etc.)
- Load raw data first, transform inside the warehouse using SQL
- Raw data always available for re-processing
- Cloud warehouses (BigQuery, Snowflake, Redshift) are cheap & fast
- dbt (data build tool) is the dominant ELT transform layer
- Better for exploration — analysts can define their own transforms
Why ETL Matters
Data rarely arrives in the format you need. Systems speak different languages — one uses user_id, another uses userId, another uses an integer vs. a UUID. Timestamps come in three timezones. Null values mean different things. Duplicates creep in. ETL solves the "data integration" problem at scale.
ETL consolidates data from dozens of systems into one reliable destination. Sales, marketing, and engineering stop arguing about whose numbers are right — there's one canonical dataset.
Business rules live in the transform stage. Malformed records get caught, logged, and quarantined — not silently corrupting downstream reports. Quality gates run on every pipeline execution.
Pre-aggregated, modelled data in a warehouse is orders of magnitude faster to query than joining raw operational tables. Analysts work with optimised marts, not raw OLTP tables.
The Three Stages
The Extract phase connects to one or more data sources and reads records. The critical rule: never modify the source. You are a reader, not a writer. Sources include relational databases (PostgreSQL, MySQL), APIs (REST, GraphQL), flat files (CSV, JSON, Parquet), cloud storage (S3, GCS), event streams (Kafka), and third-party SaaS products (Salesforce, Stripe, HubSpot).
The Transform phase is where business logic lives. This is the most complex and valuable part of ETL. It includes: cleaning (trim whitespace, fix encodings), validation (check constraints, schema conformance), deduplication, type casting, joining datasets, aggregating, deriving calculated fields, applying business rules ("a churned customer is one with no activity in 90 days"), and privacy operations (hash/mask PII).
The Load phase writes the transformed data to its destination. Key decisions here: full load (truncate and rewrite everything, simple but expensive) vs. incremental load (only new or changed records, efficient but complex). The destination might be a data warehouse, data lake, a reporting database, or a downstream application.
Extract in Detail
Extraction is deceptively difficult. Sources are unreliable, APIs rate-limit you, schemas change without notice, and connections time out. A robust extract layer handles failures gracefully, logs everything, and is idempotent — safe to re-run without double-counting.
Common extraction patterns
| Pattern | How it works | Best for |
|---|---|---|
| Full Extract | Read all records from the source every time | Small tables, no change tracking available |
| Incremental (Timestamp) | Filter by updated_at > last_run_time |
Tables with reliable updated_at columns |
| CDC — Change Data Capture | Read the database transaction log for changes | High-volume tables, near-real-time needs |
| API Pagination | Iterate through pages/cursors of an API response | REST APIs with paginated endpoints |
| File Drop | Watch a folder/bucket for new files, process on arrival | Partner data drops, legacy exports |
Extract example — database + API
Transform in Detail
Transformation is the heart of ETL. This is where raw, messy reality becomes structured, trustworthy data. Transformations range from trivial (renaming a column) to complex (sessionising event streams, slowly changing dimension tracking, multi-table joins with business logic).
- Strip leading/trailing whitespace from strings
- Standardise case (e.g. all email → lowercase)
- Parse and normalise date formats to UTC
- Handle NULL / missing values (drop, impute, or flag)
- Remove duplicate records
- Fix encoding issues (UTF-8 normalisation)
- Rename columns to consistent naming convention
- Cast types (string "123" → integer 123)
- Pivot / unpivot (wide → long, long → wide)
- Flatten nested JSON structures
- Split a column into multiple (e.g. full_name → first + last)
- Combine columns (e.g. city + country → location)
- Join with lookup/dimension tables (e.g. add country name from country_code)
- Calculate derived metrics (revenue = quantity × unit_price)
- Apply segmentation rules (user tier based on spend)
- Aggregate to target grain (daily totals per region)
- Flag anomalies or out-of-range values
- Hash or pseudonymise PII (emails, phone numbers)
- Mask credit card numbers (store only last 4 digits)
- Apply data retention rules (delete records older than N days)
- Tag records with sensitivity classification
- Enforce GDPR deletion requests (right to be forgotten)
Transform example — cleaning & enriching orders
Load in Detail
Loading is the final step — persisting your clean, transformed data. The key decision is your load strategy. Get this wrong and you'll either re-process millions of rows unnecessarily or miss updates entirely.
| Strategy | Description | Pros | Cons | Use When |
|---|---|---|---|---|
| Full Load | Truncate destination, reload everything | Simple | Slow at scale | Small tables, no CDC, dim tables |
| Append Only | Add new rows, never update existing | Very fast | Can't handle updates | Immutable events (clicks, log lines) |
| Upsert (MERGE) | Insert new, update existing by key | Handles changes | Complex queries | Most transactional data |
| SCD Type 2 | Keep full history of all changes with effective dates | Full history | Large storage | Customer, product dimension tables |
| Partition Overwrite | Replace only the affected time partition | Efficient for time data | Date-partitioned tables only | Daily/monthly aggregation tables |
Load example — upsert to PostgreSQL
When to Use ETL
ETL is not always the answer. For a single CSV report, a script works fine. But once you have multiple sources, recurring schedules, SLA requirements, or more than one consumer of the data, ETL pipelines pay their investment back quickly. Here are the most common triggers.
You want a dashboard showing daily sales, by region, by product category, updated every morning. Your data lives in a transactional MySQL database. Direct queries to MySQL are too slow and risk impacting the live site.
- Extract: Pull yesterday's orders from MySQL each night at 2am
- Transform: Join with product and region tables, calculate daily totals
- Load: Write aggregated summary to Snowflake or Redshift
- Result: BI tool (Tableau, Looker) queries the warehouse, not production
Your sales data is in Salesforce, billing in Stripe, support tickets in Zendesk. You want to understand the full customer lifecycle in one place — acquisition cost, revenue, and support cost per customer.
- Extract: Pull from Salesforce API, Stripe API, and Zendesk API on a schedule
- Transform: Map each system's customer identifier to a common
customer_id, standardise date formats, resolve conflicting data - Load: Build a unified customer table in the data warehouse
Training a churn prediction model requires a feature table: one row per customer, with columns for recency of last purchase, frequency, total spend, support ticket count, etc. These come from five different tables.
- Extract: Pull from transactions, sessions, support, and product tables
- Transform: Calculate RFM (Recency, Frequency, Monetary) features, aggregate events into windows (last 7d, 30d, 90d), handle nulls intelligently
- Load: Write feature table to a feature store or ML training data path
Migrating from an old CRM to a new one, or moving from an on-premise Oracle DB to a cloud warehouse. You need to transform the old schema into the new schema, preserving history.
- Map old field names to new ones, handle structural differences
- Apply data quality rules to clean up years of technical debt
- Validate row counts and checksums between source and destination
- Run in parallel during cutover to verify before switching live traffic
Common ETL Patterns
A pipeline is idempotent if running it twice produces the same result as running it once. This is essential — pipelines fail and need to re-run. Design your loads as upserts (not appends), track watermarks/checkpoints, and test re-runs explicitly.
Store the timestamp of the last successful run in a metadata table. The next run reads from that timestamp. If the pipeline fails mid-run, it retries from the last checkpoint — not from scratch and not forward of failed data.
Never write directly from extraction to the final table. Land raw data in a stg_ (staging) table first. Apply transforms to produce int_ (intermediate) tables. Final consumers see fct_ (fact) and dim_ (dimension) tables. This pattern (popularised by dbt) makes debugging vastly easier.
Records that fail validation or transformation go to a "dead letter" location — not silently dropped. Operators review these records, fix upstream data quality issues, and can re-process them. Essential for auditability and compliance.
Type 1: Overwrite old value (no history). Type 2: Add a new row with valid_from / valid_to dates to preserve full history — used for customer address, product price. Type 3: Add a "previous value" column. Type 2 is the most common for analytics.
Large tables (billions of rows) must be partitioned — typically by date. Each pipeline run overwrites only the affected date partition(s). This makes backfills fast, isolates failures, and dramatically reduces query scan costs in columnar warehouses.
Watermark pattern in SQL
Tools & Frameworks
The ETL tooling landscape is large. Here's an honest map — what each tool is actually for, when to use it, and when to skip it. You do not need all of these; start simple and only add complexity when you feel the pain of not having it.
| Tool | Category | Best For | Learning Curve |
|---|---|---|---|
| Pandas | Transform (Python) | Small-to-medium datasets (<10M rows), exploratory pipelines, simple transforms | Low |
| dbt | Transform (SQL / ELT) | SQL-based transforms inside a warehouse; testing, documentation, lineage built-in | Low–Medium |
| Apache Spark | Batch processing | Very large datasets (billions of rows), complex distributed transforms | High |
| Apache Airflow | Orchestration | Scheduling, dependency management, monitoring of multi-step pipelines | Medium |
| Prefect / Dagster | Orchestration (modern) | Airflow alternative; better developer experience, Pythonic, cloud-native | Low–Medium |
| Airbyte / Fivetran | Data ingestion (Extract) | 300+ pre-built connectors; manage the Extract phase without writing code | Low |
| Apache Kafka | Streaming / Event | Real-time, event-driven pipelines; high-throughput streaming ingestion | High |
| Great Expectations | Data Quality | Automated data quality checks; define expectations, validate on every run | Medium |
Python + Pandas for transforms → SQLAlchemy for database connections → Prefect for scheduling → dbt once your warehouse transforms grow beyond a few scripts. Only add Spark when your data genuinely exceeds single-machine capacity. Start simple — complexity is free to add, expensive to remove.Worked Examples
Example 1 — CSV files into a database
A sales team drops a new CSV file into an S3 bucket every morning at 7am. You need to load it into your PostgreSQL warehouse, deduplicated, with invalid rows quarantined.
Example 2 — dbt SQL transform (ELT)
In an ELT pattern, raw data lands in your warehouse first. dbt then transforms it into analytics-ready models using SQL.
Errors & Common Pitfalls
The source system adds, removes, or renames a column. Your pipeline breaks silently or crashes loudly. Fix: Validate schemas at extraction time. Use schema registries for streaming. Write contracts/tests for expected columns.
Source A sends UTC, source B sends local time, source C omits timezone entirely. After joining you have off-by-hours errors. Fix: Store everything as UTC from the first transform step. Convert at query time for display only.
Re-running a pipeline that appends (rather than upserts) double-counts records. A failed run at step 3/4 restarts from step 1 and loads rows twice. Fix: Always use upsert semantics. Make every pipeline idempotent.
A transform drops rows that fail validation without logging them. A month later: "where did these 4,000 customer records go?" Fix: Always write rejected records to a quarantine location with a rejection reason.
You ship a transform that looks right but silently produces wrong aggregations for edge cases (empty strings treated as nulls, etc.). Fix: Unit test transform functions with known inputs and expected outputs. Use dbt tests or Great Expectations for warehouse data.
Database passwords in Python scripts, pushed to git. Credentials in environment variables baked into Docker images. Fix: Always use a secrets manager (AWS Secrets Manager, HashiCorp Vault, environment variables injected at runtime). Never commit secrets.
NULL revenue means zero" vs "NULL revenue means we don't know" produce completely different business insights.Your ETL Maturity Path
ETL maturity is a journey, not a destination. Don't try to build a Spotify-grade data platform on day one. Build what you need, learn what hurts, then level up. Here's the honest path most data teams walk.
- Python or SQL scripts run by hand
- No scheduling — someone remembers
- No error handling or logging
- Credentials in plaintext
- Full re-loads every time
- One person knows how it works
- Scheduled (cron, Airflow, Prefect)
- Error handling + alerting on failure
- Incremental loads where possible
- Secrets in environment / secrets manager
- Invalid records quarantined and logged
- Basic data quality checks pass/fail
- dbt models with tests + documentation
- Data lineage visible end-to-end
- SLA monitoring (pipelines complete by X)
- Row count & statistical anomaly checks
- PII identified, masked, & governed
- Multiple team members can maintain
docs.getdbt.com — the best free SQL transform education · docs.prefect.io — modern orchestration docs · Designing Data-Intensive Applications (Kleppmann) — the canonical systems book · Fundamentals of Data Engineering (Reis & Housley) — end-to-end data engineering overview · greatexpectations.io — automated data quality · github.com/great-expectations/great_expectations for examples.