Data Engineering Field Guide

Extract,
Transform,
Load

The backbone of every data-driven organisation

A practical, no-jargon guide to understanding, designing, and running ETL pipelines — from your first script to production-grade systems. Written for analysts, engineers, and anyone who's ever asked "where does this data actually come from?"

Beginner Friendly Python Examples SQL Patterns Real Scenarios Best Practices

What Is ETL?

// DEFINITION & INTUITION

ETL stands for Extract, Transform, Load. It is the process of pulling raw data from one or more sources, reshaping it into a usable form, and storing it in a destination where it can be queried and analysed. Think of it as the plumbing of the data world — invisible when it works, catastrophic when it doesn't.

Every time you look at a business dashboard, run a report, or train a machine learning model, there is almost certainly an ETL pipeline somewhere in the chain that made that data available and coherent.

Extract

Read from source
Databases, APIs, files
Raw, messy data
No changes yet

⟶

Transform

Clean & validate
Reshape structure
Enrich & aggregate
Business logic

⟶

Load

Write to destination
Data warehouse, lake
Structured & ready
Analysts can query

💡

Analogy: Imagine you run a bakery with suppliers in three countries. ETL is like: going to each supplier to pick up raw ingredients (Extract), converting foreign measurements and removing bad produce (Transform), then stocking your shelves in a consistent layout (Load). Your bakers (analysts) now work from clean, organised ingredients.

ETL vs ELT — a quick distinction

📦 ETL — Transform First

Transform data before loading into the destination
Lighter storage requirements — only clean data lands
Better for sensitive data (scrub PII before it touches the warehouse)
Traditional approach — suited to on-premise data warehouses
Transformation logic lives in the pipeline (Python, Spark, etc.)

🏗️ ELT — Load First (Modern)

Load raw data first, transform inside the warehouse using SQL
Raw data always available for re-processing
Cloud warehouses (BigQuery, Snowflake, Redshift) are cheap & fast
dbt (data build tool) is the dominant ELT transform layer
Better for exploration — analysts can define their own transforms

Why ETL Matters

// THE BUSINESS CASE

Data rarely arrives in the format you need. Systems speak different languages — one uses user_id, another uses userId, another uses an integer vs. a UUID. Timestamps come in three timezones. Null values mean different things. Duplicates creep in. ETL solves the "data integration" problem at scale.

Single Source of Truth

ETL consolidates data from dozens of systems into one reliable destination. Sales, marketing, and engineering stop arguing about whose numbers are right — there's one canonical dataset.

Data Quality Enforcement

Business rules live in the transform stage. Malformed records get caught, logged, and quarantined — not silently corrupting downstream reports. Quality gates run on every pipeline execution.

Analytics at Scale

Pre-aggregated, modelled data in a warehouse is orders of magnitude faster to query than joining raw operational tables. Analysts work with optimised marts, not raw OLTP tables.

⚠

Without ETL: Analysts write ad-hoc queries against production databases, slowing them down. Every analyst builds their own version of "customer count" differently. There's no audit trail for data changes. Reports break when source schemas change. This is the chaos ETL prevents.

The Three Stages

// OVERVIEW OF EACH PHASE

① Extract — Reading from Sources Read-Only

The Extract phase connects to one or more data sources and reads records. The critical rule: never modify the source. You are a reader, not a writer. Sources include relational databases (PostgreSQL, MySQL), APIs (REST, GraphQL), flat files (CSV, JSON, Parquet), cloud storage (S3, GCS), event streams (Kafka), and third-party SaaS products (Salesforce, Stripe, HubSpot).

② Transform — Reshaping Data Core Logic

The Transform phase is where business logic lives. This is the most complex and valuable part of ETL. It includes: cleaning (trim whitespace, fix encodings), validation (check constraints, schema conformance), deduplication, type casting, joining datasets, aggregating, deriving calculated fields, applying business rules ("a churned customer is one with no activity in 90 days"), and privacy operations (hash/mask PII).

③ Load — Writing to Destination Persist

The Load phase writes the transformed data to its destination. Key decisions here: full load (truncate and rewrite everything, simple but expensive) vs. incremental load (only new or changed records, efficient but complex). The destination might be a data warehouse, data lake, a reporting database, or a downstream application.

Extract in Detail

// READING FROM THE WILD

Extraction is deceptively difficult. Sources are unreliable, APIs rate-limit you, schemas change without notice, and connections time out. A robust extract layer handles failures gracefully, logs everything, and is idempotent — safe to re-run without double-counting.

Common extraction patterns

Pattern	How it works	Best for
Full Extract	Read all records from the source every time	Small tables, no change tracking available
Incremental (Timestamp)	Filter by `updated_at > last_run_time`	Tables with reliable `updated_at` columns
CDC — Change Data Capture	Read the database transaction log for changes	High-volume tables, near-real-time needs
API Pagination	Iterate through pages/cursors of an API response	REST APIs with paginated endpoints
File Drop	Watch a folder/bucket for new files, process on arrival	Partner data drops, legacy exports

Extract example — database + API

Python — Incremental Database Extract

import psycopg2
import pandas as pd
from datetime import datetime

def extract_orders(last_run: datetime) -> pd.DataFrame:
    """
    Extract only orders created/updated since last pipeline run.
    Safe to re-run — uses >= not > to catch any edge cases.
    """
    conn = psycopg2.connect(
        host="db.prod.internal",
        database="ecommerce",
        user="etl_readonly",     # Read-only user — ETL never writes to source
        password=os.environ["DB_PASSWORD"]
    )

    query = """
        SELECT
            order_id,
            customer_id,
            order_status,
            total_amount,
            currency_code,
            created_at,
            updated_at
        FROM orders
        WHERE updated_at >= %(last_run)s   -- Incremental filter
        ORDER BY updated_at ASC
    """

    df = pd.read_sql(query, conn, params={"last_run": last_run})
    conn.close()

    print(f"Extracted {len(df)} orders since {last_run}")
    return df

# Usage
last_run = get_last_successful_run()   # Load from metadata store
raw_orders = extract_orders(last_run)

Python — REST API Extract with Pagination

import requests, time

def extract_from_api(endpoint: str, api_key: str) -> list:
    """Paginate through all pages of a REST API."""
    all_records = []
    page = 1
    page_size = 100

    while True:
        response = requests.get(
            endpoint,
            headers={"Authorization": f"Bearer {api_key}"},
            params={"page": page, "per_page": page_size},
            timeout=30
        )
        response.raise_for_status()   # Raise on 4xx/5xx errors
        data = response.json()

        records = data.get("results", [])
        if not records:
            break   # No more pages

        all_records.extend(records)
        print(f"Page {page}: fetched {len(records)} records")

        if not data.get("has_more", False):
            break

        page += 1
        time.sleep(0.2)   # Rate limit courtesy pause

    return all_records

Transform in Detail

// WHERE THE LOGIC LIVES

Transformation is the heart of ETL. This is where raw, messy reality becomes structured, trustworthy data. Transformations range from trivial (renaming a column) to complex (sessionising event streams, slowly changing dimension tracking, multi-table joins with business logic).

Data Cleaning Always First

Strip leading/trailing whitespace from strings
Standardise case (e.g. all email → lowercase)
Parse and normalise date formats to UTC
Handle NULL / missing values (drop, impute, or flag)
Remove duplicate records
Fix encoding issues (UTF-8 normalisation)

Structural Changes Shape

Rename columns to consistent naming convention
Cast types (string "123" → integer 123)
Pivot / unpivot (wide → long, long → wide)
Flatten nested JSON structures
Split a column into multiple (e.g. full_name → first + last)
Combine columns (e.g. city + country → location)

Business Logic Add Value

Join with lookup/dimension tables (e.g. add country name from country_code)
Calculate derived metrics (revenue = quantity × unit_price)
Apply segmentation rules (user tier based on spend)
Aggregate to target grain (daily totals per region)
Flag anomalies or out-of-range values

Privacy & Compliance Mandatory

Hash or pseudonymise PII (emails, phone numbers)
Mask credit card numbers (store only last 4 digits)
Apply data retention rules (delete records older than N days)
Tag records with sensitivity classification
Enforce GDPR deletion requests (right to be forgotten)

Transform example — cleaning & enriching orders

Python / Pandas — Transform Stage

import pandas as pd
import numpy as np

def transform_orders(df: pd.DataFrame) -> pd.DataFrame:
    """Clean, validate, and enrich raw orders data."""

    # ── 1. RENAME to consistent snake_case convention ──
    df = df.rename(columns={
        "orderID": "order_id",
        "customerID": "customer_id",
        "TotalAmount": "total_amount_raw",
    })

    # ── 2. CAST types ──
    df["total_amount_raw"] = pd.to_numeric(df["total_amount_raw"], errors="coerce")
    df["created_at"] = pd.to_datetime(df["created_at"], utc=True)

    # ── 3. VALIDATE — flag bad records instead of silently dropping ──
    df["is_valid"] = (
        df["order_id"].notna() &
        df["total_amount_raw"] > 0 &
        df["order_status"].isin(["pending", "confirmed", "shipped", "cancelled"])
    )
    invalid = df[~df["is_valid"]]
    if len(invalid) > 0:
        print(f"⚠ {len(invalid)} invalid records — routing to quarantine")
        invalid.to_parquet("quarantine/orders_invalid.parquet")
    df = df[df["is_valid"]].drop(columns=["is_valid"])

    # ── 4. DEDUPLICATE — keep latest record per order_id ──
    df = df.sort_values("updated_at", ascending=False)
    df = df.drop_duplicates(subset=["order_id"], keep="first")

    # ── 5. CURRENCY NORMALISATION — convert to USD ──
    fx_rates = {"GBP": 1.27, "EUR": 1.09, "USD": 1.0}
    df["fx_rate"] = df["currency_code"].map(fx_rates).fillna(1.0)
    df["total_amount_usd"] = (df["total_amount_raw"] * df["fx_rate"]).round(2)

    # ── 6. DERIVED FIELDS — add useful calculated columns ──
    df["order_date"] = df["created_at"].dt.date
    df["order_month"] = df["created_at"].dt.to_period("M").astype(str)
    df["is_high_value"] = df["total_amount_usd"] >= 500

    # ── 7. FINAL COLUMN SELECTION — only what the warehouse needs ──
    return df[[
        "order_id", "customer_id", "order_status",
        "total_amount_usd", "currency_code",
        "order_date", "order_month", "is_high_value",
        "created_at", "updated_at"
    ]]

✔

Quarantine, don't drop: When a record fails validation, never silently drop it. Write it to a quarantine table or file with a rejection reason. This is how you discover upstream bugs, catch schema changes, and maintain an audit trail. "Silent drops" are how data quality issues stay hidden for months.

Load in Detail

// WRITING TO THE DESTINATION

Loading is the final step — persisting your clean, transformed data. The key decision is your load strategy. Get this wrong and you'll either re-process millions of rows unnecessarily or miss updates entirely.

Strategy	Description	Pros	Cons	Use When
Full Load	Truncate destination, reload everything	Simple	Slow at scale	Small tables, no CDC, dim tables
Append Only	Add new rows, never update existing	Very fast	Can't handle updates	Immutable events (clicks, log lines)
Upsert (MERGE)	Insert new, update existing by key	Handles changes	Complex queries	Most transactional data
SCD Type 2	Keep full history of all changes with effective dates	Full history	Large storage	Customer, product dimension tables
Partition Overwrite	Replace only the affected time partition	Efficient for time data	Date-partitioned tables only	Daily/monthly aggregation tables

Load example — upsert to PostgreSQL

Python — Upsert Load Pattern

import psycopg2
from psycopg2.extras import execute_values

def load_orders(df, conn):
    """
    Upsert: insert new orders, update existing ones by order_id.
    Uses PostgreSQL's ON CONFLICT DO UPDATE (UPSERT).
    """
    records = [tuple(row) for row in df.itertuples(index=False)]

    upsert_sql = """
        INSERT INTO warehouse.orders (
            order_id, customer_id, order_status,
            total_amount_usd, order_date, order_month,
            is_high_value, created_at, updated_at
        ) VALUES %s
        ON CONFLICT (order_id) DO UPDATE SET   -- If order exists, update it
            order_status    = EXCLUDED.order_status,
            total_amount_usd = EXCLUDED.total_amount_usd,
            is_high_value   = EXCLUDED.is_high_value,
            updated_at      = EXCLUDED.updated_at
    """

    with conn.cursor() as cur:
        execute_values(cur, upsert_sql, records, page_size=1000)

    conn.commit()
    print(f"Loaded {len(df)} records")

# Putting it all together
def run_pipeline():
    last_run = get_last_run_time()
    raw    = extract_orders(last_run)      # Step 1: Extract
    clean  = transform_orders(raw)         # Step 2: Transform
    load_orders(clean, get_warehouse_conn()) # Step 3: Load
    save_run_time(datetime.utcnow())       # Record checkpoint

run_pipeline()

When to Use ETL

// USE CASES & SCENARIOS

ETL is not always the answer. For a single CSV report, a script works fine. But once you have multiple sources, recurring schedules, SLA requirements, or more than one consumer of the data, ETL pipelines pay their investment back quickly. Here are the most common triggers.

🏪 Business Intelligence & Reporting Very Common

You want a dashboard showing daily sales, by region, by product category, updated every morning. Your data lives in a transactional MySQL database. Direct queries to MySQL are too slow and risk impacting the live site.

Extract: Pull yesterday's orders from MySQL each night at 2am
Transform: Join with product and region tables, calculate daily totals
Load: Write aggregated summary to Snowflake or Redshift
Result: BI tool (Tableau, Looker) queries the warehouse, not production

🔗 Merging Data from Multiple Systems Very Common

Your sales data is in Salesforce, billing in Stripe, support tickets in Zendesk. You want to understand the full customer lifecycle in one place — acquisition cost, revenue, and support cost per customer.

Extract: Pull from Salesforce API, Stripe API, and Zendesk API on a schedule
Transform: Map each system's customer identifier to a common customer_id, standardise date formats, resolve conflicting data
Load: Build a unified customer table in the data warehouse

🤖 Feature Engineering for Machine Learning Data Science

Training a churn prediction model requires a feature table: one row per customer, with columns for recency of last purchase, frequency, total spend, support ticket count, etc. These come from five different tables.

Extract: Pull from transactions, sessions, support, and product tables
Transform: Calculate RFM (Recency, Frequency, Monetary) features, aggregate events into windows (last 7d, 30d, 90d), handle nulls intelligently
Load: Write feature table to a feature store or ML training data path

🏛️ Data Migrations One-Time / Periodic

Migrating from an old CRM to a new one, or moving from an on-premise Oracle DB to a cloud warehouse. You need to transform the old schema into the new schema, preserving history.

Map old field names to new ones, handle structural differences
Apply data quality rules to clean up years of technical debt
Validate row counts and checksums between source and destination
Run in parallel during cutover to verify before switching live traffic

📋

Quick checklist — do you need ETL? If you answer yes to 2 or more: you have more than one data source; your report takes longer than acceptable to run; analysts query production databases directly; the same metric produces different numbers depending on who runs it; your data loads are manual / ad-hoc; you have compliance requirements for data lineage — yes, you need ETL.

Common ETL Patterns

// DESIGN PATTERNS YOU'LL USE REPEATEDLY

Idempotency Critical

A pipeline is idempotent if running it twice produces the same result as running it once. This is essential — pipelines fail and need to re-run. Design your loads as upserts (not appends), track watermarks/checkpoints, and test re-runs explicitly.

Watermark / Checkpoint Pattern

Store the timestamp of the last successful run in a metadata table. The next run reads from that timestamp. If the pipeline fails mid-run, it retries from the last checkpoint — not from scratch and not forward of failed data.

Staging Tables

Never write directly from extraction to the final table. Land raw data in a stg_ (staging) table first. Apply transforms to produce int_ (intermediate) tables. Final consumers see fct_ (fact) and dim_ (dimension) tables. This pattern (popularised by dbt) makes debugging vastly easier.

Dead Letter Queue

Records that fail validation or transformation go to a "dead letter" location — not silently dropped. Operators review these records, fix upstream data quality issues, and can re-process them. Essential for auditability and compliance.

Slowly Changing Dimensions (SCD)

Type 1: Overwrite old value (no history). Type 2: Add a new row with valid_from / valid_to dates to preserve full history — used for customer address, product price. Type 3: Add a "previous value" column. Type 2 is the most common for analytics.

Partitioning Strategy

Large tables (billions of rows) must be partitioned — typically by date. Each pipeline run overwrites only the affected date partition(s). This makes backfills fast, isolates failures, and dramatically reduces query scan costs in columnar warehouses.

Watermark pattern in SQL

SQL — Checkpoint / Watermark Pattern

-- Metadata table to track pipeline state
CREATE TABLE etl_metadata.pipeline_runs (
    pipeline_name   TEXT        NOT NULL,
    last_run_at     TIMESTAMPTZ NOT NULL,
    rows_processed  BIGINT,
    status          TEXT        CHECK (status IN ('success', 'failed', 'running')),
    PRIMARY KEY (pipeline_name)
);

-- Get the watermark before running
SELECT last_run_at
FROM   etl_metadata.pipeline_runs
WHERE  pipeline_name = 'orders_daily'
  AND  status = 'success';

-- After success, update the watermark
INSERT INTO etl_metadata.pipeline_runs
  (pipeline_name, last_run_at, rows_processed, status)
VALUES
  ('orders_daily', NOW(), 4821, 'success')
ON CONFLICT (pipeline_name) DO UPDATE SET
  last_run_at    = EXCLUDED.last_run_at,
  rows_processed = EXCLUDED.rows_processed,
  status         = EXCLUDED.status;

Tools & Frameworks

// THE ECOSYSTEM

The ETL tooling landscape is large. Here's an honest map — what each tool is actually for, when to use it, and when to skip it. You do not need all of these; start simple and only add complexity when you feel the pain of not having it.

Tool	Category	Best For	Learning Curve
Pandas	Transform (Python)	Small-to-medium datasets (<10M rows), exploratory pipelines, simple transforms	Low
dbt	Transform (SQL / ELT)	SQL-based transforms inside a warehouse; testing, documentation, lineage built-in	Low–Medium
Apache Spark	Batch processing	Very large datasets (billions of rows), complex distributed transforms	High
Apache Airflow	Orchestration	Scheduling, dependency management, monitoring of multi-step pipelines	Medium
Prefect / Dagster	Orchestration (modern)	Airflow alternative; better developer experience, Pythonic, cloud-native	Low–Medium
Airbyte / Fivetran	Data ingestion (Extract)	300+ pre-built connectors; manage the Extract phase without writing code	Low
Apache Kafka	Streaming / Event	Real-time, event-driven pipelines; high-throughput streaming ingestion	High
Great Expectations	Data Quality	Automated data quality checks; define expectations, validate on every run	Medium

✔

Recommended starter stack: Python + Pandas for transforms → SQLAlchemy for database connections → Prefect for scheduling → dbt once your warehouse transforms grow beyond a few scripts. Only add Spark when your data genuinely exceeds single-machine capacity. Start simple — complexity is free to add, expensive to remove.

Worked Examples

// END-TO-END CODE WALKTHROUGHS

Example 1 — CSV files into a database

A sales team drops a new CSV file into an S3 bucket every morning at 7am. You need to load it into your PostgreSQL warehouse, deduplicated, with invalid rows quarantined.

Python — CSV File ETL Pipeline

import pandas as pd
import boto3, psycopg2, logging
from io import StringIO

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("sales_etl")

# ─────────────────── EXTRACT ───────────────────
def extract_from_s3(bucket: str, key: str) -> pd.DataFrame:
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket=bucket, Key=key)
    df = pd.read_csv(StringIO(obj["Body"].read().decode("utf-8")))
    log.info(f"Extracted {len(df)} rows from s3://{bucket}/{key}")
    return df

# ─────────────────── TRANSFORM ───────────────────
def transform_sales(df: pd.DataFrame) -> tuple:
    # Standardise column names
    df.columns = df.columns.str.lower().str.replace(" ", "_")

    # Cast types
    df["sale_date"]  = pd.to_datetime(df["sale_date"], dayfirst=True)
    df["amount"]     = pd.to_numeric(df["amount"], errors="coerce")
    df["product_sku"] = df["product_sku"].str.upper().str.strip()

    # Separate valid and invalid
    mask = df["amount"].notna() & df["sale_id"].notna() & (df["amount"] > 0)
    valid, invalid = df[mask].copy(), df[~mask].copy()
    log.info(f"Valid: {len(valid)}  |  Invalid (quarantined): {len(invalid)}")

    # Deduplicate by sale_id — keep latest
    valid = valid.drop_duplicates(subset=["sale_id"], keep="last")

    # Derive helpful columns
    valid["year_month"]  = valid["sale_date"].dt.strftime("%Y-%m")
    valid["loaded_at"]   = pd.Timestamp.utcnow()

    return valid, invalid

# ─────────────────── LOAD ───────────────────
def load_to_warehouse(df: pd.DataFrame, conn):
    with conn.cursor() as cur:
        for _, row in df.iterrows():
            cur.execute("""
                INSERT INTO sales.daily_sales
                  (sale_id, sale_date, product_sku, amount, year_month, loaded_at)
                VALUES (%s, %s, %s, %s, %s, %s)
                ON CONFLICT (sale_id) DO UPDATE SET
                  amount    = EXCLUDED.amount,
                  loaded_at = EXCLUDED.loaded_at
            """, (row.sale_id, row.sale_date, row.product_sku,
                     row.amount, row.year_month, row.loaded_at))
    conn.commit()
    log.info(f"Loaded {len(df)} rows to warehouse")

# ─────────────────── ORCHESTRATE ───────────────────
if __name__ == "__main__":
    raw             = extract_from_s3("my-data-bucket", "sales/today.csv")
    clean, rejected = transform_sales(raw)
    if len(rejected) > 0:
        rejected.to_csv("quarantine/sales_rejected.csv", index=False)
    conn = psycopg2.connect(os.environ["WAREHOUSE_URL"])
    load_to_warehouse(clean, conn)
    conn.close()

Example 2 — dbt SQL transform (ELT)

In an ELT pattern, raw data lands in your warehouse first. dbt then transforms it into analytics-ready models using SQL.

dbt SQL — Staging Model (models/staging/stg_orders.sql)

-- models/staging/stg_orders.sql
-- This is the first transform layer — rename, cast, clean raw data

WITH source AS (
    SELECT * FROM {{ source('ecommerce', 'orders') }}
),

renamed AS (
    SELECT
        -- Rename to snake_case
        orderID::TEXT               AS order_id,
        customerID::TEXT            AS customer_id,
        LOWER(orderStatus)          AS order_status,

        -- Cast and normalise
        TotalAmount::NUMERIC(12,2)   AS total_amount,
        UPPER(currencyCode)         AS currency_code,

        -- Standardise timestamps to UTC
        created_at AT TIME ZONE 'UTC' AS created_at_utc,
        created_at::DATE            AS order_date

    FROM source
    WHERE orderID IS NOT NULL
      AND TotalAmount > 0
)

SELECT * FROM renamed

-- dbt tests live in schema.yml — they run automatically
-- tests: unique(order_id), not_null(order_id), accepted_values(order_status)

dbt SQL — Fact Model (models/marts/fct_daily_revenue.sql)

-- models/marts/fct_daily_revenue.sql
-- Aggregated fact table — what the BI tool and analysts query

WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}  -- ref() creates dependency lineage
    WHERE order_status = 'confirmed'
),

fx_rates AS (
    SELECT * FROM {{ ref('seed_fx_rates') }}
)

SELECT
    o.order_date,
    COUNT(DISTINCT o.order_id)                          AS order_count,
    COUNT(DISTINCT o.customer_id)                       AS unique_customers,
    SUM(o.total_amount * fx.rate_to_usd)                AS revenue_usd,
    AVG(o.total_amount * fx.rate_to_usd)                AS avg_order_value_usd,
    SUM(o.total_amount * fx.rate_to_usd)
        FILTER (WHERE o.total_amount * fx.rate_to_usd >= 500)  AS high_value_revenue_usd

FROM       orders o
LEFT JOIN  fx_rates fx USING (currency_code)
GROUP BY   o.order_date
ORDER BY   o.order_date DESC

Errors & Common Pitfalls

// WHAT BREAKS & HOW TO FIX IT

Schema Drift Very Common

The source system adds, removes, or renames a column. Your pipeline breaks silently or crashes loudly. Fix: Validate schemas at extraction time. Use schema registries for streaming. Write contracts/tests for expected columns.

Timezone Bugs Painful

Source A sends UTC, source B sends local time, source C omits timezone entirely. After joining you have off-by-hours errors. Fix: Store everything as UTC from the first transform step. Convert at query time for display only.

Duplicate Records

Re-running a pipeline that appends (rather than upserts) double-counts records. A failed run at step 3/4 restarts from step 1 and loads rows twice. Fix: Always use upsert semantics. Make every pipeline idempotent.

Silent Data Loss

A transform drops rows that fail validation without logging them. A month later: "where did these 4,000 customer records go?" Fix: Always write rejected records to a quarantine location with a rejection reason.

Not Testing Transforms

You ship a transform that looks right but silently produces wrong aggregations for edge cases (empty strings treated as nulls, etc.). Fix: Unit test transform functions with known inputs and expected outputs. Use dbt tests or Great Expectations for warehouse data.

Hardcoded Credentials

Database passwords in Python scripts, pushed to git. Credentials in environment variables baked into Docker images. Fix: Always use a secrets manager (AWS Secrets Manager, HashiCorp Vault, environment variables injected at runtime). Never commit secrets.

⚠

The NULL problem: NULL means different things in different systems — "unknown", "not applicable", "zero", or "this field didn't exist yet". Define explicit rules for NULL handling in your transform spec before you write a line of code. "NULL revenue means zero" vs "NULL revenue means we don't know" produce completely different business insights.

Your ETL Maturity Path

// WHERE YOU ARE & WHERE TO GO NEXT

ETL maturity is a journey, not a destination. Don't try to build a Spotify-grade data platform on day one. Build what you need, learn what hurts, then level up. Here's the honest path most data teams walk.

Level 01

Manual / Scripted

Python or SQL scripts run by hand
No scheduling — someone remembers
No error handling or logging
Credentials in plaintext
Full re-loads every time
One person knows how it works

Level 02

Automated & Reliable

Scheduled (cron, Airflow, Prefect)
Error handling + alerting on failure
Incremental loads where possible
Secrets in environment / secrets manager
Invalid records quarantined and logged
Basic data quality checks pass/fail

Level 03

Observable & Governed

dbt models with tests + documentation
Data lineage visible end-to-end
SLA monitoring (pipelines complete by X)
Row count & statistical anomaly checks
PII identified, masked, & governed
Multiple team members can maintain

WEEK 1-2

Write your first pipeline

One source → Pandas transform → database

MONTH 1

Add scheduling & alerting

Prefect or Airflow, Slack alerts on failure

MONTH 2–3

Introduce dbt

SQL transforms, tests, staging → mart pattern

MONTH 4–6

Data quality layer

Great Expectations or dbt tests, quarantine queues

MONTH 6+

Governance & lineage

Cataloguing (DataHub), PII classification, SLAs

📚

Resources to go deeper: docs.getdbt.com — the best free SQL transform education · docs.prefect.io — modern orchestration docs · Designing Data-Intensive Applications (Kleppmann) — the canonical systems book · Fundamentals of Data Engineering (Reis & Housley) — end-to-end data engineering overview · greatexpectations.io — automated data quality · github.com/great-expectations/great_expectations for examples.