Data Engineering Field Guide

ELT
Unpacked

// Extract · Load · Transform

A practical guide to the modern data pipeline pattern powering analytics, machine learning, and business intelligence at scale — written for engineers at every level.

Cloud-Native Data Warehousing dbt & SQL Beginner Friendly

What is ELT?

// THE MODERN DATA PIPELINE PARADIGM

ELT stands for Extract, Load, Transform. It is a data integration pattern where raw data is first pulled from source systems, loaded directly into a destination (usually a cloud data warehouse), and then transformed in-place using the warehouse's own compute power.

Think of it like this: instead of cooking your ingredients before putting them in the fridge, you store everything raw and cook on demand. The warehouse is your industrial kitchen — it handles the heavy lifting.

Extract

Pull raw data from
databases, APIs,
SaaS tools, files

→

Load

Write raw data
directly into the
data warehouse

→

Transform

Clean, model & shape
data using SQL
inside the warehouse

Raw Data First

Nothing is thrown away. Raw data lands in the warehouse exactly as it arrived. You can always re-transform if requirements change.

Warehouse as Engine

Transformation logic runs inside BigQuery, Snowflake, or Redshift — leveraging massively parallel compute you'd never get on a single server.

SQL as the Language

Analysts and engineers both speak SQL. ELT democratizes data transformation — you don't need Spark or Java to build production-grade models.

ℹ

ELT vs ETL in one sentence: ETL transforms data before it reaches the destination. ELT transforms data after it lands. This seemingly small difference changes everything about how you architect, maintain, and scale a data platform.

ETL vs ELT

// UNDERSTANDING THE SHIFT

The traditional ETL pattern dominated data engineering for decades. It was born in an era when storage was expensive, cloud warehouses didn't exist, and transformation had to happen before data touched the destination. Cloud changed everything.

✕ ETL (Legacy Pattern)

Transform happens outside the warehouse (Spark, SSIS, Informatica)
Only clean, pre-modeled data enters the warehouse
Raw data often discarded — no historical re-processing
Changes to logic require re-running entire pipeline
Transformation code in Python, Java, Scala
Requires specialized engineering to maintain
Works well with on-premise databases and limited storage

✓ ELT (Modern Pattern)

Transform happens inside the warehouse using SQL
Raw data is preserved — full audit trail and replay
Change logic, re-run transform only — no re-extraction
SQL democratizes who can write transformations
Storage is cheap — keep everything raw
dbt, Dataform, or plain SQL handles transformation layer
Purpose-built for cloud warehouses (Snowflake, BQ, Redshift)

Dimension	ETL	ELT
Where does transform run?	External server or middleware	Inside the data warehouse
Raw data preserved?	Usually not	Yes — always
Primary language	Python, Java, Scala	SQL (+ optional Python)
Latency	Higher — data processed before landing	Lower — load fast, transform later
Best for	Sensitive data needing masking before storage, complex ML pipelines	Analytics, BI, reporting, most data teams
Typical tools	Informatica, SSIS, Spark, AWS Glue	Fivetran, Airbyte + dbt + Snowflake/BQ
Team skill requirement	Data engineers (Scala/Python)	Data/analytics engineers (SQL)

⚡

Neither is "better" universally. ELT is the right default for modern analytics teams. ETL still wins when: data must be masked before storage for compliance (HIPAA, GDPR PII scrubbing), you're ingesting into a system without SQL-based compute, or your transformations are too complex for SQL alone.

When to Use ELT

// DECISION GUIDE

ELT is the right choice for the vast majority of modern data teams. Use this guide to decide. If you're building a new analytics platform from scratch on a cloud warehouse, ELT is your default answer.

✓ Choose ELT When Use This

You're using a cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks)
Your team knows SQL better than Python/Scala
You need an audit trail of raw data for debugging or re-processing
Your transformation logic changes frequently
You want analysts to write their own data models
You're building a modern analytics stack from scratch
Storage cost is not a constraint (cloud storage is cheap)
You need fast initial data availability before transform is complete

✗ Reconsider ELT When Evaluate

PII or sensitive data must be masked before reaching the warehouse
Destination system lacks SQL compute (e.g., flat file store, old RDBMS)
Your transformations require complex ML or non-SQL logic as a first step
Extreme low-latency requirements (sub-second streaming with complex transforms)
Regulatory requirements mandate data never lands raw
Your data volumes are tiny and a simple script is more pragmatic

ELT by Use Case

Use Case	ELT Fit?	Why
Business Intelligence / Dashboards	IDEAL	Analytics queries run against transformed warehouse models — exactly what ELT produces
Marketing Attribution	IDEAL	Multiple SaaS sources (Google Ads, Salesforce, HubSpot) unified in warehouse via ELT
Financial Reporting	STRONG	Raw transaction data preserved, transformations versioned and auditable
ML Feature Engineering	PARTIAL	ELT for feature tables; Python/Spark may still be needed for complex ML preprocessing
Real-Time Streaming	PARTIAL	Loading streaming data works; sub-second transformation may need stream processors (Flink, Spark Streaming)
HIPAA Data with PII	CAUTION	Must mask/de-identify PII before it lands — use ETL preprocessing or tokenization at extract stage

Extract

// PULLING DATA FROM THE SOURCE

The Extract step connects to your data sources and pulls records. This sounds simple but is where most pipeline complexity lives: authentication, rate limits, pagination, incremental vs full loads, schema drift, and API throttling are all extract-layer problems.

Full Extract

Pull the entire table or dataset every run. Simple but expensive for large tables. Use for small, frequently-changing reference tables.

Incremental Extract

Pull only records that changed since the last run. Uses a watermark (updated_at timestamp or auto-increment ID). Scales to large tables. More complex to implement correctly.

CDC (Change Data Capture)

Read the database transaction log (binlog, WAL) to capture every insert, update, and delete. Near-real-time. Used with tools like Debezium. No load on source database from queries.

Common Data Sources

Source Type	Examples	Extraction Method	Challenges
Relational DBs	Postgres, MySQL, SQL Server	SQL queries, CDC (Debezium), JDBC	Load on source DB, schema changes
REST APIs	Stripe, Salesforce, HubSpot, Shopify	HTTP GET with pagination, OAuth	Rate limits, pagination, auth token rotation
Event Streams	Kafka, Kinesis, Pub/Sub	Consumer groups, connectors	Ordering guarantees, duplicate events
Flat Files	CSV, JSON, Parquet, Excel	S3/GCS listeners, SFTP polling	Schema inconsistency, missing values, encoding
SaaS Apps	Google Analytics, Marketo, Zendesk	Managed connectors (Fivetran, Airbyte)	Vendor API changes, undocumented schemas
NoSQL / Documents	MongoDB, DynamoDB, Firestore	Dump + parse, CDC, change streams	Schema-less = inconsistent structures

Python — Incremental API Extract

# Example: extract orders from a REST API incrementally
import requests, json
from datetime import datetime, timedelta

def extract_orders(last_run_at: str) -> list[dict]:
    """Pull orders updated since last_run_at (ISO format)."""
    all_records = []
    page = 1

    while True:
        resp = requests.get(
            "https://api.mystore.com/v1/orders",
            headers={"Authorization": f"Bearer {API_KEY}"},
            params={
                "updated_after": last_run_at,   # incremental watermark
                "page": page,
                "per_page": 100
            }
        )
        resp.raise_for_status()
        data = resp.json()

        all_records.extend(data["orders"])

        if not data["has_more"]:   # stop when no more pages
            break
        page += 1

    return all_records

# Run and store watermark for next incremental load
orders = extract_orders(last_run_at="2024-11-01T00:00:00Z")
print(f"Extracted {len(orders)} records")   # Extracted 1,247 records

▸

Managed connectors save weeks of work. Tools like Fivetran and Airbyte handle authentication, pagination, rate limiting, incremental syncing, and schema changes for 300+ sources out of the box. For standard SaaS sources, always use a managed connector unless you have a very specific reason not to.

Load

// LANDING RAW DATA IN THE WAREHOUSE

Loading is the act of writing extracted data into your destination. In ELT, you load raw, unmodified data into a dedicated raw schema. Speed and fidelity are the goals here — you are not cleaning anything yet.

Load Strategies

Append / Insert Only Simple

Each run appends new records. Historical runs are fully preserved. Great for event logs, audit tables, and immutable data like transactions. Requires deduplication in the transform layer.

Upsert (Merge) Common

Insert new records, update existing ones based on a primary key. Keeps the table current without duplicates. Most common for entity tables (users, products, accounts).

Full Refresh Reliable

Truncate and reload the entire table every run. Simple and avoids drift. Works well for small tables (<1M rows). High cost for large tables due to re-scanning everything.

Partitioned Load Scalable

Write data into date/time-based partitions. Enables efficient partition-pruning in queries. Load replaces only the relevant partition. Standard pattern for large event tables in BigQuery.

SQL — Upsert Pattern (Snowflake MERGE)

-- Load raw orders: upsert into raw layer using MERGE
-- Staging table: raw_stage.orders_stage (just-loaded batch)
-- Target table: raw.orders (accumulating raw records)

MERGE INTO raw.orders AS target
USING raw_stage.orders_stage AS source
    ON target.order_id = source.order_id   -- match on primary key

WHEN MATCHED THEN UPDATE SET
    target.status       = source.status,
    target.updated_at   = source.updated_at,
    target.total_amount = source.total_amount,
    target._loaded_at   = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN INSERT (
    order_id, customer_id, status, total_amount,
    created_at, updated_at, _loaded_at
)
VALUES (
    source.order_id, source.customer_id, source.status,
    source.total_amount, source.created_at, source.updated_at,
    CURRENT_TIMESTAMP()
);

-- Convention: prefix metadata cols with _ to distinguish from source cols
-- _loaded_at: when did this record arrive in the warehouse?
-- _extracted_at: when did the extractor pull it from source?

Raw Layer Conventions

ℹ

Raw schema design rules: Keep it boring. (1) One schema per source system: raw_stripe, raw_salesforce. (2) Never modify raw data — it is the source of truth. (3) Add metadata columns: _loaded_at, _source_file, _batch_id. (4) Store as-is, even nulls and broken values — those are data quality issues to solve in transform. (5) Grant read-only access to everyone, write access only to the pipeline.

Transform

// MAKING RAW DATA USEFUL

Transformation is where raw data becomes business-ready. In ELT, this happens entirely inside the warehouse using SQL — and almost always with a tool called dbt (data build tool). Transforms are organized into layers.

The Three-Layer Model

🗄️

Raw

raw_stripe.*

→

🔧

Staging

stg_stripe.*

→

🧱

Intermediate

int_orders.*

→

📊

Marts

fct_*, dim_*

→

📈

BI Tools

Looker, Tableau

Staging Layer

1:1 with raw tables. Light cleaning only: rename columns, cast types, filter deleted records. No business logic. One staging model per source table.

Rename amt → amount_usd
Cast VARCHAR → DATE
Exclude soft-deleted rows
Standardize booleans ('Y'/'N' → TRUE/FALSE)

Intermediate Layer

Business logic lives here. Join staging tables, compute derived fields, build reusable building blocks. Not exposed to analysts directly.

Join orders + customers
Compute order lifetime value
Resolve slowly-changing dimensions
Aggregate daily events

Marts Layer

Final, business-facing models organized by domain. Fact tables (events) and dimension tables (entities). This is what analysts and BI tools query.

fct_orders — one row per order
dim_customers — one row per customer
fct_revenue_daily — aggregated
mart_sales — domain-wide bundle

SQL (dbt) — Staging Model

-- models/staging/stg_stripe__charges.sql
-- Staging model: clean raw Stripe charges data

WITH source AS (
    SELECT * FROM {{ source('stripe', 'charges') }}
),

renamed AS (
    SELECT
        -- Primary key
        id                          AS charge_id,

        -- Relationships
        customer                    AS customer_id,
        invoice                     AS invoice_id,

        -- Amounts: Stripe stores cents, convert to dollars
        amount / 100.0              AS amount_usd,
        amount_refunded / 100.0    AS amount_refunded_usd,
        currency                    AS currency_code,

        -- Status
        status,
        captured                    AS is_captured,
        paid                        AS is_paid,
        refunded                    AS is_refunded,

        -- Timestamps
        TO_TIMESTAMP(created)        AS created_at,

        -- Metadata
        _loaded_at

    FROM source
    WHERE status != 'failed'    -- exclude failed charges from start
)

SELECT * FROM renamed

SQL (dbt) — Fact Table Model

-- models/marts/finance/fct_revenue.sql
-- Fact table: one row per revenue event

WITH charges AS (
    SELECT * FROM {{ ref('stg_stripe__charges') }}
    WHERE is_paid = TRUE
),

customers AS (
    SELECT * FROM {{ ref('dim_customers') }}
),

final AS (
    SELECT
        c.charge_id,
        c.customer_id,
        cust.customer_segment,
        cust.country_code,
        c.amount_usd,
        c.amount_refunded_usd,
        c.amount_usd - c.amount_refunded_usd  AS net_revenue_usd,
        c.currency_code,
        c.created_at,
        DATE_TRUNC('month', c.created_at)  AS revenue_month
    FROM charges c
    LEFT JOIN customers cust
        ON c.customer_id = cust.customer_id
)

SELECT * FROM final

Tools & Ecosystem

// THE MODERN DATA STACK

The modern ELT stack has a clear separation of concerns: ingestion tools (Extract + Load), a cloud warehouse (storage + compute), and a transformation framework (Transform). You pick one from each category.

Layer	Tool	What It Does	Best For
Extract + Load	Fivetran	Managed connectors for 300+ SaaS/DB sources. Fully automated, handles schema changes, incremental syncing, log-based CDC.	Teams that want zero-maintenance pipelines. Highest reliability but expensive.
Extract + Load	Airbyte	Open-source alternative to Fivetran with 350+ connectors. Can self-host or use Airbyte Cloud. More flexible, lower cost.	Teams comfortable with ops overhead who want flexibility and cost control.
Extract + Load	Stitch	Simple, lightweight SaaS ingestion. Good for getting started quickly at moderate scale.	Startups and smaller teams that don't need Fivetran's full feature set.
Warehouse	Snowflake	Cloud data warehouse. Separates storage and compute. Near-unlimited scale. Strong SQL support, time-travel, zero-copy cloning.	Enterprise and mid-market. Best-in-class features. Premium pricing.
Warehouse	BigQuery	Google's serverless data warehouse. Pay-per-query model. Native ML (BQML). Best integration with GCP ecosystem.	Google Cloud shops. Serverless simplicity. Often cheapest at high scale.
Warehouse	Redshift	Amazon's columnar warehouse. Deep AWS integration. Spectrum for S3 querying.	AWS-native stacks. Good for teams already invested in the AWS ecosystem.
Warehouse	Databricks	Lakehouse platform combining data lake + warehouse. Strong for ML workloads, Delta Lake format.	Data science-heavy teams that need both analytics and ML at scale.
Transform	dbt	The standard tool for SQL-based transformation. Models as .sql files, version control, tests, docs, lineage. Open-source core + dbt Cloud (managed).	Almost every modern data team. The de facto standard for ELT transforms.
Transform	Dataform	Google's transformation tool (acquired 2020). Similar to dbt, now part of BigQuery. SQLX syntax.	Teams fully on GCP/BigQuery who prefer native Google tooling.
Orchestration	Airflow	Open-source workflow orchestrator. Define DAGs in Python. Schedules and monitors pipeline runs. Industry standard.	Teams needing complex scheduling, dependencies, and custom operators.
Orchestration	Dagster / Prefect	Modern Airflow alternatives with better developer experience, observability, and asset-centric thinking.	New pipelines where developer UX and observability are priorities.

▸

The starter stack: If you're building your first ELT platform, start with Airbyte (free ingestion) + BigQuery (generous free tier) + dbt Core (free). This stack costs nearly nothing to start, scales to millions of rows, and is industry-standard. Graduate to Fivetran + Snowflake when you need managed reliability.

Real-World Examples

// ELT IN PRACTICE

The best way to understand ELT is to walk through complete scenarios. Below are three worked examples spanning common use cases: e-commerce analytics, SaaS metrics, and marketing attribution.

Example 1 — E-Commerce Revenue Pipeline

An e-commerce company wants to track daily revenue, refund rates, and customer lifetime value from Shopify and Stripe.

🛒

Shopify

Orders, Products

→

💳

Stripe

Payments, Refunds

→

📦

Fivetran

Extract + Load

→

❄️

Snowflake

Raw Schema

→

🔧

dbt

Staging → Marts

→

📊

Looker

Dashboards

SQL (dbt) — Customer Lifetime Value Model

-- models/marts/ecommerce/dim_customers_with_clv.sql
-- Builds a customer dimension enriched with lifetime value metrics

WITH customers AS (
    SELECT * FROM {{ ref('stg_shopify__customers') }}
),

orders AS (
    SELECT * FROM {{ ref('stg_shopify__orders') }}
    WHERE financial_status = 'paid'
),

-- Aggregate order history per customer
customer_orders AS (
    SELECT
        customer_id,
        COUNT(*)                      AS total_orders,
        SUM(total_price_usd)          AS total_revenue_usd,
        AVG(total_price_usd)          AS avg_order_value_usd,
        MIN(created_at)               AS first_order_at,
        MAX(created_at)               AS last_order_at,
        DATEDIFF('day', MIN(created_at), MAX(created_at))
                                       AS customer_lifespan_days
    FROM orders
    GROUP BY 1
)

SELECT
    c.customer_id,
    c.email,
    c.first_name,
    c.last_name,
    c.country_code,
    c.created_at AS customer_since,
    -- CLV metrics
    COALESCE(o.total_orders, 0)          AS lifetime_orders,
    COALESCE(o.total_revenue_usd, 0)     AS lifetime_revenue_usd,
    COALESCE(o.avg_order_value_usd, 0)   AS avg_order_value_usd,
    o.first_order_at,
    o.last_order_at,
    o.customer_lifespan_days,
    -- Segment customers by value
    CASE
        WHEN o.total_revenue_usd >= 1000  THEN 'High Value'
        WHEN o.total_revenue_usd >= 200   THEN 'Mid Value'
        WHEN o.total_revenue_usd > 0      THEN 'Low Value'
        ELSE 'No Purchase'
    END AS customer_segment
FROM customers c
LEFT JOIN customer_orders o ON c.customer_id = o.customer_id

Example 2 — SaaS Product Metrics (MRR & Churn)

SQL (dbt) — Monthly Recurring Revenue

-- models/marts/finance/fct_mrr.sql
-- Calculates MRR snapshots and flags new/expansion/contraction/churn

WITH subscriptions AS (
    SELECT * FROM {{ ref('stg_stripe__subscriptions') }}
),

-- Generate one row per subscription per month it was active
monthly_subs AS (
    SELECT
        subscription_id,
        customer_id,
        plan_name,
        monthly_amount_usd,
        DATE_TRUNC('month', month_date) AS snapshot_month
    FROM subscriptions,
    LATERAL FLATTEN(
        INPUT => ARRAY_GENERATE_RANGE(
            DATE_TRUNC('month', started_at),
            DATE_TRUNC('month', COALESCE(ended_at, CURRENT_DATE())),
            1, 'month'
        )
    ) month_date
    WHERE status != 'incomplete'
),

-- Compare to previous month to classify MRR movement
mrr_with_lag AS (
    SELECT
        *,
        LAG(monthly_amount_usd) OVER (
            PARTITION BY subscription_id
            ORDER BY snapshot_month
        ) AS prev_month_mrr
    FROM monthly_subs
)

SELECT
    snapshot_month,
    subscription_id,
    customer_id,
    plan_name,
    monthly_amount_usd                       AS mrr,
    CASE
        WHEN prev_month_mrr IS NULL             THEN 'new'
        WHEN monthly_amount_usd > prev_month_mrr  THEN 'expansion'
        WHEN monthly_amount_usd < prev_month_mrr  THEN 'contraction'
        ELSE                                       'retained'
    END AS mrr_type
FROM mrr_with_lag

Example 3 — Marketing Attribution

SQL (dbt) — Last-Touch Attribution

-- models/marts/marketing/fct_attribution_last_touch.sql
-- Credits revenue to the last marketing touchpoint before conversion

WITH conversions AS (
    SELECT user_id, order_id, revenue_usd, converted_at
    FROM {{ ref('fct_revenue') }}
),

touchpoints AS (
    SELECT user_id, channel, campaign, touched_at
    FROM {{ ref('stg_segment__events') }}
    WHERE event_type = 'ad_click'
),

-- Find the last touchpoint before each conversion
last_touch AS (
    SELECT
        c.order_id,
        c.user_id,
        c.revenue_usd,
        c.converted_at,
        t.channel,
        t.campaign,
        ROW_NUMBER() OVER (
            PARTITION BY c.order_id
            ORDER BY t.touched_at DESC   -- most recent touch first
        ) AS touch_rank
    FROM conversions c
    LEFT JOIN touchpoints t
        ON c.user_id = t.user_id
        AND t.touched_at < c.converted_at   -- only touchpoints BEFORE conversion
        AND t.touched_at >= DATEADD('day', -30, c.converted_at) -- 30-day window
)

SELECT
    order_id, user_id, revenue_usd, converted_at,
    COALESCE(channel, 'direct')    AS attributed_channel,
    COALESCE(campaign, '(none)')   AS attributed_campaign
FROM last_touch
WHERE touch_rank = 1  -- keep only the last touch per order

Design Patterns

// PROVEN APPROACHES FOR COMMON SCENARIOS

Slowly Changing Dimensions SCD Type 2

Track how entity attributes change over time. Instead of overwriting, add a new row with valid_from / valid_to timestamps. Query at a point-in-time by joining on the date range.

Example: a customer changes their country. SCD Type 2 preserves the old record so you can analyze what segment they were in at time of purchase.

Idempotent Pipelines Critical

Running your pipeline twice should produce the same result as running it once. Achieve this by: (1) using MERGE/upsert not raw INSERT, (2) partitioned overwrites, (3) deduplication keys. Without idempotency, re-runs double-count data.

Snapshot Tables dbt Built-in

dbt Snapshots automatically capture historical states of mutable tables. Point dbt at a source table, define a strategy (timestamp or check), and dbt handles SCD Type 2 logic for you with dbt snapshot.

Incremental Models Performance

For large tables, dbt incremental models only process new/changed rows on each run instead of rebuilding the full table. Reduces warehouse costs dramatically at scale. Add {{ config(materialized='incremental') }} to your model.

SQL (dbt) — Incremental Model Pattern

-- models/marts/fct_events.sql
-- Incremental model: only processes new events on each run

{{ config(
    materialized='incremental',
    unique_key='event_id',              -- dedup key for upsert
    on_schema_change='sync_all_columns'  -- handle new columns gracefully
) }}

WITH source AS (
    SELECT * FROM {{ ref('stg_segment__events') }}

    -- This block only runs on incremental runs (not full refresh)
    {% if is_incremental() %}
        WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
    {% endif %}
)

SELECT
    event_id,
    user_id,
    event_type,
    properties,
    occurred_at
FROM source

Data Quality

// TRUST YOUR PIPELINE

A pipeline that produces wrong data is worse than no pipeline. Data quality testing should be built into every layer — not bolted on later. dbt ships with a built-in testing framework that makes this easy.

dbt Generic Tests Built-in

not_null — column must never be null
unique — column values must be distinct
accepted_values — column must be in a defined set
relationships — foreign key must exist in referenced table

Custom & Advanced Tests Beyond Basics

Row count checks — did the table suddenly lose 50% of rows?
Freshness checks — is the most recent record too old?
Distribution tests — is revenue_usd within expected range?
Cross-model consistency — do order counts match between tables?

YAML — dbt Tests Configuration

# models/marts/ecommerce/schema.yml
version: 2

models:
  - name: fct_orders
    description: "One row per paid order"
    columns:
      - name: order_id
        description: "Unique identifier for each order"
        tests:
          - unique                  # no duplicate orders
          - not_null                # every order must have an ID

      - name: customer_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_customers')
              field: customer_id    # foreign key check

      - name: status
        tests:
          - accepted_values:
              values: ['pending', 'paid', 'refunded', 'cancelled']

      - name: total_amount_usd
        tests:
          - not_null
          - dbt_utils.accepted_range:   # custom range test (dbt_utils package)
              min_value: 0
              max_value: 100000      # flag suspiciously large amounts

Common Pitfalls

// MISTAKES EVERY TEAM MAKES

Not Testing Your Pipeline Danger

The #1 mistake. Untested pipelines silently produce wrong numbers. By the time you notice, six months of dashboards are wrong. Add dbt tests from day one — it takes 10 minutes and saves weeks.

Transforming in the Extract Step Anti-Pattern

In ELT, the extractor should be dumb — just move data. If you start adding business logic in your Airbyte custom transformations or Python extractor, you're rebuilding ETL and losing ELT's advantages.

Fan-Out Joins (Query Explosion) Watch Out

Joining a table on a non-unique key multiplies rows. Always verify join cardinality. Add a unique test on your staging model's primary key before building on top of it.

Ignoring Schema Drift Common

Source APIs add, rename, or remove columns without warning. Your pipeline breaks silently. Configure your ingestion tool to alert on schema changes, and add on_schema_change logic in dbt incremental models.

No Watermark / Non-Idempotent Loads Data Quality

If your load step uses raw INSERT without deduplication, re-running the pipeline duplicates data. Always design loads to be safe to re-run: MERGE, partitioned overwrites, or truncate-and-reload.

Monolithic "God" dbt Models Maintainability

A 400-line SQL model that does everything is impossible to debug or extend. Break complex logic into intermediate models with clear names. Each model should do one thing. The three-layer pattern (staging → intermediate → marts) enforces this.

⚠

Never query the raw layer in BI tools. Analysts and dashboards should only query from the marts layer. If raw data lands in a dashboard, the next schema change upstream breaks the dashboard. The raw layer is internal infrastructure — marts are the public API of your data platform.

Learning Roadmap

// YOUR PATH FROM BEGINNER TO PRACTITIONER

ELT is a skill that compounds. The table below maps a realistic learning path — start at your current level and work forward. Most people go from beginner to job-ready in 3–4 months of consistent practice.

Stage 01

Foundation

SQL basics — SELECT, JOIN, GROUP BY, window functions
Understand ETL vs ELT conceptually
Set up BigQuery free tier + connect Airbyte
Load a CSV and write your first query
Learn what a data warehouse is vs a transactional DB

Stage 02

Practitioner

Build dbt project: staging + one mart model
Write dbt tests (not_null, unique, accepted_values)
Understand incremental vs full-refresh models
Connect a real SaaS source via Airbyte/Fivetran
Schedule a pipeline with Airflow or Dagster

Stage 03

Intermediate

Implement SCD Type 2 with dbt Snapshots
Build a full three-layer dbt project (stg → int → mart)
Understand fact vs dimension table patterns
Add custom data quality tests + alerting
Profile and optimize slow warehouse queries

Stage 04

Advanced

Design a multi-source analytical data platform
Implement CDC with Debezium + Kafka
Build a data contract between teams
Deploy dbt in CI/CD with PR-based testing
Data observability with Monte Carlo or Elementary

▸

Resources to go deeper: docs.getdbt.com (the best free data engineering education) · airbyte.com/tutorials (connector walkthroughs) · docs.snowflake.com · Fundamentals of Data Engineering by Joe Reis & Matt Housley (book) · The dbt Viewpoint (dbt Labs blog) · github.com/dbt-labs/jaffle-shop (canonical beginner dbt project to clone and explore).

Resource	Type	Best For
dbt Learn (learn.getdbt.com)	Free online course	Best structured intro to dbt and ELT patterns
Jaffle Shop (dbt Labs)	GitHub project	Reference dbt project to clone and learn from
BigQuery Sandbox	Free cloud service	Practice SQL and dbt with no cost for first 10GB/month
Airbyte Quickstart	Tutorial	Set up your first ELT pipeline in an afternoon
Fundamentals of Data Engineering	Book (O'Reilly)	Comprehensive overview of the full data engineering landscape
dbt Slack Community	Community	Best place to get unstuck, 50,000+ members

ELTUnpacked

What is ELT?

ETL vs ELT

When to Use ELT

ELT by Use Case

Extract

Common Data Sources

Load

Load Strategies

Raw Layer Conventions

Transform

The Three-Layer Model

Tools & Ecosystem

Real-World Examples

Example 1 — E-Commerce Revenue Pipeline

Example 2 — SaaS Product Metrics (MRR & Churn)

Example 3 — Marketing Attribution

Design Patterns

Data Quality

Common Pitfalls

Learning Roadmap

ELT
Unpacked