??????? ETL Handbook — Extract, Transform, Load
V1
Field Guide
ETL — Data Engineering Handbook
DOC-ETL-2024 Beginner → Intermediate
Data Engineering Field Guide

Extract,
Transform,
Load

The backbone of every data-driven organisation

A practical, no-jargon guide to understanding, designing, and running ETL pipelines — from your first script to production-grade systems. Written for analysts, engineers, and anyone who's ever asked "where does this data actually come from?"

Beginner Friendly Python Examples SQL Patterns Real Scenarios Best Practices
01

What Is ETL?

// DEFINITION & INTUITION

ETL stands for Extract, Transform, Load. It is the process of pulling raw data from one or more sources, reshaping it into a usable form, and storing it in a destination where it can be queried and analysed. Think of it as the plumbing of the data world — invisible when it works, catastrophic when it doesn't.

Every time you look at a business dashboard, run a report, or train a machine learning model, there is almost certainly an ETL pipeline somewhere in the chain that made that data available and coherent.

E
Extract
  • Read from source
  • Databases, APIs, files
  • Raw, messy data
  • No changes yet
T
Transform
  • Clean & validate
  • Reshape structure
  • Enrich & aggregate
  • Business logic
L
Load
  • Write to destination
  • Data warehouse, lake
  • Structured & ready
  • Analysts can query
💡
Analogy: Imagine you run a bakery with suppliers in three countries. ETL is like: going to each supplier to pick up raw ingredients (Extract), converting foreign measurements and removing bad produce (Transform), then stocking your shelves in a consistent layout (Load). Your bakers (analysts) now work from clean, organised ingredients.

ETL vs ELT — a quick distinction

📦 ETL — Transform First
  • Transform data before loading into the destination
  • Lighter storage requirements — only clean data lands
  • Better for sensitive data (scrub PII before it touches the warehouse)
  • Traditional approach — suited to on-premise data warehouses
  • Transformation logic lives in the pipeline (Python, Spark, etc.)
🏗️ ELT — Load First (Modern)
  • Load raw data first, transform inside the warehouse using SQL
  • Raw data always available for re-processing
  • Cloud warehouses (BigQuery, Snowflake, Redshift) are cheap & fast
  • dbt (data build tool) is the dominant ELT transform layer
  • Better for exploration — analysts can define their own transforms
02

Why ETL Matters

// THE BUSINESS CASE

Data rarely arrives in the format you need. Systems speak different languages — one uses user_id, another uses userId, another uses an integer vs. a UUID. Timestamps come in three timezones. Null values mean different things. Duplicates creep in. ETL solves the "data integration" problem at scale.

Single Source of Truth

ETL consolidates data from dozens of systems into one reliable destination. Sales, marketing, and engineering stop arguing about whose numbers are right — there's one canonical dataset.

Data Quality Enforcement

Business rules live in the transform stage. Malformed records get caught, logged, and quarantined — not silently corrupting downstream reports. Quality gates run on every pipeline execution.

Analytics at Scale

Pre-aggregated, modelled data in a warehouse is orders of magnitude faster to query than joining raw operational tables. Analysts work with optimised marts, not raw OLTP tables.

Without ETL: Analysts write ad-hoc queries against production databases, slowing them down. Every analyst builds their own version of "customer count" differently. There's no audit trail for data changes. Reports break when source schemas change. This is the chaos ETL prevents.
03

The Three Stages

// OVERVIEW OF EACH PHASE
① Extract — Reading from Sources Read-Only

The Extract phase connects to one or more data sources and reads records. The critical rule: never modify the source. You are a reader, not a writer. Sources include relational databases (PostgreSQL, MySQL), APIs (REST, GraphQL), flat files (CSV, JSON, Parquet), cloud storage (S3, GCS), event streams (Kafka), and third-party SaaS products (Salesforce, Stripe, HubSpot).

② Transform — Reshaping Data Core Logic

The Transform phase is where business logic lives. This is the most complex and valuable part of ETL. It includes: cleaning (trim whitespace, fix encodings), validation (check constraints, schema conformance), deduplication, type casting, joining datasets, aggregating, deriving calculated fields, applying business rules ("a churned customer is one with no activity in 90 days"), and privacy operations (hash/mask PII).

③ Load — Writing to Destination Persist

The Load phase writes the transformed data to its destination. Key decisions here: full load (truncate and rewrite everything, simple but expensive) vs. incremental load (only new or changed records, efficient but complex). The destination might be a data warehouse, data lake, a reporting database, or a downstream application.

04

Extract in Detail

// READING FROM THE WILD

Extraction is deceptively difficult. Sources are unreliable, APIs rate-limit you, schemas change without notice, and connections time out. A robust extract layer handles failures gracefully, logs everything, and is idempotent — safe to re-run without double-counting.

Common extraction patterns

PatternHow it worksBest for
Full Extract Read all records from the source every time Small tables, no change tracking available
Incremental (Timestamp) Filter by updated_at > last_run_time Tables with reliable updated_at columns
CDC — Change Data Capture Read the database transaction log for changes High-volume tables, near-real-time needs
API Pagination Iterate through pages/cursors of an API response REST APIs with paginated endpoints
File Drop Watch a folder/bucket for new files, process on arrival Partner data drops, legacy exports

Extract example — database + API

Python — Incremental Database Extract
import psycopg2 import pandas as pd from datetime import datetime def extract_orders(last_run: datetime) -> pd.DataFrame: """ Extract only orders created/updated since last pipeline run. Safe to re-run — uses >= not > to catch any edge cases. """ conn = psycopg2.connect( host="db.prod.internal", database="ecommerce", user="etl_readonly", # Read-only user — ETL never writes to source password=os.environ["DB_PASSWORD"] ) query = """ SELECT order_id, customer_id, order_status, total_amount, currency_code, created_at, updated_at FROM orders WHERE updated_at >= %(last_run)s -- Incremental filter ORDER BY updated_at ASC """ df = pd.read_sql(query, conn, params={"last_run": last_run}) conn.close() print(f"Extracted {len(df)} orders since {last_run}") return df # Usage last_run = get_last_successful_run() # Load from metadata store raw_orders = extract_orders(last_run)
Python — REST API Extract with Pagination
import requests, time def extract_from_api(endpoint: str, api_key: str) -> list: """Paginate through all pages of a REST API.""" all_records = [] page = 1 page_size = 100 while True: response = requests.get( endpoint, headers={"Authorization": f"Bearer {api_key}"}, params={"page": page, "per_page": page_size}, timeout=30 ) response.raise_for_status() # Raise on 4xx/5xx errors data = response.json() records = data.get("results", []) if not records: break # No more pages all_records.extend(records) print(f"Page {page}: fetched {len(records)} records") if not data.get("has_more", False): break page += 1 time.sleep(0.2) # Rate limit courtesy pause return all_records
05

Transform in Detail

// WHERE THE LOGIC LIVES

Transformation is the heart of ETL. This is where raw, messy reality becomes structured, trustworthy data. Transformations range from trivial (renaming a column) to complex (sessionising event streams, slowly changing dimension tracking, multi-table joins with business logic).

Data Cleaning Always First
  • Strip leading/trailing whitespace from strings
  • Standardise case (e.g. all email → lowercase)
  • Parse and normalise date formats to UTC
  • Handle NULL / missing values (drop, impute, or flag)
  • Remove duplicate records
  • Fix encoding issues (UTF-8 normalisation)
Structural Changes Shape
  • Rename columns to consistent naming convention
  • Cast types (string "123" → integer 123)
  • Pivot / unpivot (wide → long, long → wide)
  • Flatten nested JSON structures
  • Split a column into multiple (e.g. full_name → first + last)
  • Combine columns (e.g. city + country → location)
Business Logic Add Value
  • Join with lookup/dimension tables (e.g. add country name from country_code)
  • Calculate derived metrics (revenue = quantity × unit_price)
  • Apply segmentation rules (user tier based on spend)
  • Aggregate to target grain (daily totals per region)
  • Flag anomalies or out-of-range values
Privacy & Compliance Mandatory
  • Hash or pseudonymise PII (emails, phone numbers)
  • Mask credit card numbers (store only last 4 digits)
  • Apply data retention rules (delete records older than N days)
  • Tag records with sensitivity classification
  • Enforce GDPR deletion requests (right to be forgotten)

Transform example — cleaning & enriching orders

Python / Pandas — Transform Stage
import pandas as pd import numpy as np def transform_orders(df: pd.DataFrame) -> pd.DataFrame: """Clean, validate, and enrich raw orders data.""" # ── 1. RENAME to consistent snake_case convention ── df = df.rename(columns={ "orderID": "order_id", "customerID": "customer_id", "TotalAmount": "total_amount_raw", }) # ── 2. CAST types ── df["total_amount_raw"] = pd.to_numeric(df["total_amount_raw"], errors="coerce") df["created_at"] = pd.to_datetime(df["created_at"], utc=True) # ── 3. VALIDATE — flag bad records instead of silently dropping ── df["is_valid"] = ( df["order_id"].notna() & df["total_amount_raw"] > 0 & df["order_status"].isin(["pending", "confirmed", "shipped", "cancelled"]) ) invalid = df[~df["is_valid"]] if len(invalid) > 0: print(f"⚠ {len(invalid)} invalid records — routing to quarantine") invalid.to_parquet("quarantine/orders_invalid.parquet") df = df[df["is_valid"]].drop(columns=["is_valid"]) # ── 4. DEDUPLICATE — keep latest record per order_id ── df = df.sort_values("updated_at", ascending=False) df = df.drop_duplicates(subset=["order_id"], keep="first") # ── 5. CURRENCY NORMALISATION — convert to USD ── fx_rates = {"GBP": 1.27, "EUR": 1.09, "USD": 1.0} df["fx_rate"] = df["currency_code"].map(fx_rates).fillna(1.0) df["total_amount_usd"] = (df["total_amount_raw"] * df["fx_rate"]).round(2) # ── 6. DERIVED FIELDS — add useful calculated columns ── df["order_date"] = df["created_at"].dt.date df["order_month"] = df["created_at"].dt.to_period("M").astype(str) df["is_high_value"] = df["total_amount_usd"] >= 500 # ── 7. FINAL COLUMN SELECTION — only what the warehouse needs ── return df[[ "order_id", "customer_id", "order_status", "total_amount_usd", "currency_code", "order_date", "order_month", "is_high_value", "created_at", "updated_at" ]]
Quarantine, don't drop: When a record fails validation, never silently drop it. Write it to a quarantine table or file with a rejection reason. This is how you discover upstream bugs, catch schema changes, and maintain an audit trail. "Silent drops" are how data quality issues stay hidden for months.
06

Load in Detail

// WRITING TO THE DESTINATION

Loading is the final step — persisting your clean, transformed data. The key decision is your load strategy. Get this wrong and you'll either re-process millions of rows unnecessarily or miss updates entirely.

StrategyDescriptionProsConsUse When
Full Load Truncate destination, reload everything Simple Slow at scale Small tables, no CDC, dim tables
Append Only Add new rows, never update existing Very fast Can't handle updates Immutable events (clicks, log lines)
Upsert (MERGE) Insert new, update existing by key Handles changes Complex queries Most transactional data
SCD Type 2 Keep full history of all changes with effective dates Full history Large storage Customer, product dimension tables
Partition Overwrite Replace only the affected time partition Efficient for time data Date-partitioned tables only Daily/monthly aggregation tables

Load example — upsert to PostgreSQL

Python — Upsert Load Pattern
import psycopg2 from psycopg2.extras import execute_values def load_orders(df, conn): """ Upsert: insert new orders, update existing ones by order_id. Uses PostgreSQL's ON CONFLICT DO UPDATE (UPSERT). """ records = [tuple(row) for row in df.itertuples(index=False)] upsert_sql = """ INSERT INTO warehouse.orders ( order_id, customer_id, order_status, total_amount_usd, order_date, order_month, is_high_value, created_at, updated_at ) VALUES %s ON CONFLICT (order_id) DO UPDATE SET -- If order exists, update it order_status = EXCLUDED.order_status, total_amount_usd = EXCLUDED.total_amount_usd, is_high_value = EXCLUDED.is_high_value, updated_at = EXCLUDED.updated_at """ with conn.cursor() as cur: execute_values(cur, upsert_sql, records, page_size=1000) conn.commit() print(f"Loaded {len(df)} records") # Putting it all together def run_pipeline(): last_run = get_last_run_time() raw = extract_orders(last_run) # Step 1: Extract clean = transform_orders(raw) # Step 2: Transform load_orders(clean, get_warehouse_conn()) # Step 3: Load save_run_time(datetime.utcnow()) # Record checkpoint run_pipeline()
07

When to Use ETL

// USE CASES & SCENARIOS

ETL is not always the answer. For a single CSV report, a script works fine. But once you have multiple sources, recurring schedules, SLA requirements, or more than one consumer of the data, ETL pipelines pay their investment back quickly. Here are the most common triggers.

🏪 Business Intelligence & Reporting Very Common

You want a dashboard showing daily sales, by region, by product category, updated every morning. Your data lives in a transactional MySQL database. Direct queries to MySQL are too slow and risk impacting the live site.

  • Extract: Pull yesterday's orders from MySQL each night at 2am
  • Transform: Join with product and region tables, calculate daily totals
  • Load: Write aggregated summary to Snowflake or Redshift
  • Result: BI tool (Tableau, Looker) queries the warehouse, not production
🔗 Merging Data from Multiple Systems Very Common

Your sales data is in Salesforce, billing in Stripe, support tickets in Zendesk. You want to understand the full customer lifecycle in one place — acquisition cost, revenue, and support cost per customer.

  • Extract: Pull from Salesforce API, Stripe API, and Zendesk API on a schedule
  • Transform: Map each system's customer identifier to a common customer_id, standardise date formats, resolve conflicting data
  • Load: Build a unified customer table in the data warehouse
🤖 Feature Engineering for Machine Learning Data Science

Training a churn prediction model requires a feature table: one row per customer, with columns for recency of last purchase, frequency, total spend, support ticket count, etc. These come from five different tables.

  • Extract: Pull from transactions, sessions, support, and product tables
  • Transform: Calculate RFM (Recency, Frequency, Monetary) features, aggregate events into windows (last 7d, 30d, 90d), handle nulls intelligently
  • Load: Write feature table to a feature store or ML training data path
🏛️ Data Migrations One-Time / Periodic

Migrating from an old CRM to a new one, or moving from an on-premise Oracle DB to a cloud warehouse. You need to transform the old schema into the new schema, preserving history.

  • Map old field names to new ones, handle structural differences
  • Apply data quality rules to clean up years of technical debt
  • Validate row counts and checksums between source and destination
  • Run in parallel during cutover to verify before switching live traffic
📋
Quick checklist — do you need ETL? If you answer yes to 2 or more: you have more than one data source; your report takes longer than acceptable to run; analysts query production databases directly; the same metric produces different numbers depending on who runs it; your data loads are manual / ad-hoc; you have compliance requirements for data lineage — yes, you need ETL.
08

Common ETL Patterns

// DESIGN PATTERNS YOU'LL USE REPEATEDLY
Idempotency Critical

A pipeline is idempotent if running it twice produces the same result as running it once. This is essential — pipelines fail and need to re-run. Design your loads as upserts (not appends), track watermarks/checkpoints, and test re-runs explicitly.

Watermark / Checkpoint Pattern

Store the timestamp of the last successful run in a metadata table. The next run reads from that timestamp. If the pipeline fails mid-run, it retries from the last checkpoint — not from scratch and not forward of failed data.

Staging Tables

Never write directly from extraction to the final table. Land raw data in a stg_ (staging) table first. Apply transforms to produce int_ (intermediate) tables. Final consumers see fct_ (fact) and dim_ (dimension) tables. This pattern (popularised by dbt) makes debugging vastly easier.

Dead Letter Queue

Records that fail validation or transformation go to a "dead letter" location — not silently dropped. Operators review these records, fix upstream data quality issues, and can re-process them. Essential for auditability and compliance.

Slowly Changing Dimensions (SCD)

Type 1: Overwrite old value (no history). Type 2: Add a new row with valid_from / valid_to dates to preserve full history — used for customer address, product price. Type 3: Add a "previous value" column. Type 2 is the most common for analytics.

Partitioning Strategy

Large tables (billions of rows) must be partitioned — typically by date. Each pipeline run overwrites only the affected date partition(s). This makes backfills fast, isolates failures, and dramatically reduces query scan costs in columnar warehouses.

Watermark pattern in SQL

SQL — Checkpoint / Watermark Pattern
-- Metadata table to track pipeline state CREATE TABLE etl_metadata.pipeline_runs ( pipeline_name TEXT NOT NULL, last_run_at TIMESTAMPTZ NOT NULL, rows_processed BIGINT, status TEXT CHECK (status IN ('success', 'failed', 'running')), PRIMARY KEY (pipeline_name) ); -- Get the watermark before running SELECT last_run_at FROM etl_metadata.pipeline_runs WHERE pipeline_name = 'orders_daily' AND status = 'success'; -- After success, update the watermark INSERT INTO etl_metadata.pipeline_runs (pipeline_name, last_run_at, rows_processed, status) VALUES ('orders_daily', NOW(), 4821, 'success') ON CONFLICT (pipeline_name) DO UPDATE SET last_run_at = EXCLUDED.last_run_at, rows_processed = EXCLUDED.rows_processed, status = EXCLUDED.status;
09

Tools & Frameworks

// THE ECOSYSTEM

The ETL tooling landscape is large. Here's an honest map — what each tool is actually for, when to use it, and when to skip it. You do not need all of these; start simple and only add complexity when you feel the pain of not having it.

ToolCategoryBest ForLearning Curve
Pandas Transform (Python) Small-to-medium datasets (<10M rows), exploratory pipelines, simple transforms Low
dbt Transform (SQL / ELT) SQL-based transforms inside a warehouse; testing, documentation, lineage built-in Low–Medium
Apache Spark Batch processing Very large datasets (billions of rows), complex distributed transforms High
Apache Airflow Orchestration Scheduling, dependency management, monitoring of multi-step pipelines Medium
Prefect / Dagster Orchestration (modern) Airflow alternative; better developer experience, Pythonic, cloud-native Low–Medium
Airbyte / Fivetran Data ingestion (Extract) 300+ pre-built connectors; manage the Extract phase without writing code Low
Apache Kafka Streaming / Event Real-time, event-driven pipelines; high-throughput streaming ingestion High
Great Expectations Data Quality Automated data quality checks; define expectations, validate on every run Medium
Recommended starter stack: Python + Pandas for transforms → SQLAlchemy for database connections → Prefect for scheduling → dbt once your warehouse transforms grow beyond a few scripts. Only add Spark when your data genuinely exceeds single-machine capacity. Start simple — complexity is free to add, expensive to remove.
10

Worked Examples

// END-TO-END CODE WALKTHROUGHS

Example 1 — CSV files into a database

A sales team drops a new CSV file into an S3 bucket every morning at 7am. You need to load it into your PostgreSQL warehouse, deduplicated, with invalid rows quarantined.

Python — CSV File ETL Pipeline
import pandas as pd import boto3, psycopg2, logging from io import StringIO logging.basicConfig(level=logging.INFO) log = logging.getLogger("sales_etl") # ─────────────────── EXTRACT ─────────────────── def extract_from_s3(bucket: str, key: str) -> pd.DataFrame: s3 = boto3.client("s3") obj = s3.get_object(Bucket=bucket, Key=key) df = pd.read_csv(StringIO(obj["Body"].read().decode("utf-8"))) log.info(f"Extracted {len(df)} rows from s3://{bucket}/{key}") return df # ─────────────────── TRANSFORM ─────────────────── def transform_sales(df: pd.DataFrame) -> tuple: # Standardise column names df.columns = df.columns.str.lower().str.replace(" ", "_") # Cast types df["sale_date"] = pd.to_datetime(df["sale_date"], dayfirst=True) df["amount"] = pd.to_numeric(df["amount"], errors="coerce") df["product_sku"] = df["product_sku"].str.upper().str.strip() # Separate valid and invalid mask = df["amount"].notna() & df["sale_id"].notna() & (df["amount"] > 0) valid, invalid = df[mask].copy(), df[~mask].copy() log.info(f"Valid: {len(valid)} | Invalid (quarantined): {len(invalid)}") # Deduplicate by sale_id — keep latest valid = valid.drop_duplicates(subset=["sale_id"], keep="last") # Derive helpful columns valid["year_month"] = valid["sale_date"].dt.strftime("%Y-%m") valid["loaded_at"] = pd.Timestamp.utcnow() return valid, invalid # ─────────────────── LOAD ─────────────────── def load_to_warehouse(df: pd.DataFrame, conn): with conn.cursor() as cur: for _, row in df.iterrows(): cur.execute(""" INSERT INTO sales.daily_sales (sale_id, sale_date, product_sku, amount, year_month, loaded_at) VALUES (%s, %s, %s, %s, %s, %s) ON CONFLICT (sale_id) DO UPDATE SET amount = EXCLUDED.amount, loaded_at = EXCLUDED.loaded_at """, (row.sale_id, row.sale_date, row.product_sku, row.amount, row.year_month, row.loaded_at)) conn.commit() log.info(f"Loaded {len(df)} rows to warehouse") # ─────────────────── ORCHESTRATE ─────────────────── if __name__ == "__main__": raw = extract_from_s3("my-data-bucket", "sales/today.csv") clean, rejected = transform_sales(raw) if len(rejected) > 0: rejected.to_csv("quarantine/sales_rejected.csv", index=False) conn = psycopg2.connect(os.environ["WAREHOUSE_URL"]) load_to_warehouse(clean, conn) conn.close()

Example 2 — dbt SQL transform (ELT)

In an ELT pattern, raw data lands in your warehouse first. dbt then transforms it into analytics-ready models using SQL.

dbt SQL — Staging Model (models/staging/stg_orders.sql)
-- models/staging/stg_orders.sql -- This is the first transform layer — rename, cast, clean raw data WITH source AS ( SELECT * FROM {{ source('ecommerce', 'orders') }} ), renamed AS ( SELECT -- Rename to snake_case orderID::TEXT AS order_id, customerID::TEXT AS customer_id, LOWER(orderStatus) AS order_status, -- Cast and normalise TotalAmount::NUMERIC(12,2) AS total_amount, UPPER(currencyCode) AS currency_code, -- Standardise timestamps to UTC created_at AT TIME ZONE 'UTC' AS created_at_utc, created_at::DATE AS order_date FROM source WHERE orderID IS NOT NULL AND TotalAmount > 0 ) SELECT * FROM renamed -- dbt tests live in schema.yml — they run automatically -- tests: unique(order_id), not_null(order_id), accepted_values(order_status)
dbt SQL — Fact Model (models/marts/fct_daily_revenue.sql)
-- models/marts/fct_daily_revenue.sql -- Aggregated fact table — what the BI tool and analysts query WITH orders AS ( SELECT * FROM {{ ref('stg_orders') }} -- ref() creates dependency lineage WHERE order_status = 'confirmed' ), fx_rates AS ( SELECT * FROM {{ ref('seed_fx_rates') }} ) SELECT o.order_date, COUNT(DISTINCT o.order_id) AS order_count, COUNT(DISTINCT o.customer_id) AS unique_customers, SUM(o.total_amount * fx.rate_to_usd) AS revenue_usd, AVG(o.total_amount * fx.rate_to_usd) AS avg_order_value_usd, SUM(o.total_amount * fx.rate_to_usd) FILTER (WHERE o.total_amount * fx.rate_to_usd >= 500) AS high_value_revenue_usd FROM orders o LEFT JOIN fx_rates fx USING (currency_code) GROUP BY o.order_date ORDER BY o.order_date DESC
11

Errors & Common Pitfalls

// WHAT BREAKS & HOW TO FIX IT
Schema Drift Very Common

The source system adds, removes, or renames a column. Your pipeline breaks silently or crashes loudly. Fix: Validate schemas at extraction time. Use schema registries for streaming. Write contracts/tests for expected columns.

Timezone Bugs Painful

Source A sends UTC, source B sends local time, source C omits timezone entirely. After joining you have off-by-hours errors. Fix: Store everything as UTC from the first transform step. Convert at query time for display only.

Duplicate Records

Re-running a pipeline that appends (rather than upserts) double-counts records. A failed run at step 3/4 restarts from step 1 and loads rows twice. Fix: Always use upsert semantics. Make every pipeline idempotent.

Silent Data Loss

A transform drops rows that fail validation without logging them. A month later: "where did these 4,000 customer records go?" Fix: Always write rejected records to a quarantine location with a rejection reason.

Not Testing Transforms

You ship a transform that looks right but silently produces wrong aggregations for edge cases (empty strings treated as nulls, etc.). Fix: Unit test transform functions with known inputs and expected outputs. Use dbt tests or Great Expectations for warehouse data.

Hardcoded Credentials

Database passwords in Python scripts, pushed to git. Credentials in environment variables baked into Docker images. Fix: Always use a secrets manager (AWS Secrets Manager, HashiCorp Vault, environment variables injected at runtime). Never commit secrets.

The NULL problem: NULL means different things in different systems — "unknown", "not applicable", "zero", or "this field didn't exist yet". Define explicit rules for NULL handling in your transform spec before you write a line of code. "NULL revenue means zero" vs "NULL revenue means we don't know" produce completely different business insights.
12

Your ETL Maturity Path

// WHERE YOU ARE & WHERE TO GO NEXT

ETL maturity is a journey, not a destination. Don't try to build a Spotify-grade data platform on day one. Build what you need, learn what hurts, then level up. Here's the honest path most data teams walk.

Level 01
Manual / Scripted
  • Python or SQL scripts run by hand
  • No scheduling — someone remembers
  • No error handling or logging
  • Credentials in plaintext
  • Full re-loads every time
  • One person knows how it works
Level 02
Automated & Reliable
  • Scheduled (cron, Airflow, Prefect)
  • Error handling + alerting on failure
  • Incremental loads where possible
  • Secrets in environment / secrets manager
  • Invalid records quarantined and logged
  • Basic data quality checks pass/fail
Level 03
Observable & Governed
  • dbt models with tests + documentation
  • Data lineage visible end-to-end
  • SLA monitoring (pipelines complete by X)
  • Row count & statistical anomaly checks
  • PII identified, masked, & governed
  • Multiple team members can maintain
WEEK 1-2
Write your first pipeline
One source → Pandas transform → database
MONTH 1
Add scheduling & alerting
Prefect or Airflow, Slack alerts on failure
MONTH 2–3
Introduce dbt
SQL transforms, tests, staging → mart pattern
MONTH 4–6
Data quality layer
Great Expectations or dbt tests, quarantine queues
MONTH 6+
Governance & lineage
Cataloguing (DataHub), PII classification, SLAs
📚
Resources to go deeper: docs.getdbt.com — the best free SQL transform education · docs.prefect.io — modern orchestration docs · Designing Data-Intensive Applications (Kleppmann) — the canonical systems book · Fundamentals of Data Engineering (Reis & Housley) — end-to-end data engineering overview · greatexpectations.io — automated data quality · github.com/great-expectations/great_expectations for examples.