??????? ELT Handbook — Extract, Load, Transform
Field Guide
ELT — Extract, Load, Transform
DOC-ELT-001 Beginner → Intermediate
Data Engineering Field Guide

ELT
Unpacked

// Extract · Load · Transform

A practical guide to the modern data pipeline pattern powering analytics, machine learning, and business intelligence at scale — written for engineers at every level.

Cloud-Native Data Warehousing dbt & SQL Beginner Friendly
01

What is ELT?

// THE MODERN DATA PIPELINE PARADIGM

ELT stands for Extract, Load, Transform. It is a data integration pattern where raw data is first pulled from source systems, loaded directly into a destination (usually a cloud data warehouse), and then transformed in-place using the warehouse's own compute power.

Think of it like this: instead of cooking your ingredients before putting them in the fridge, you store everything raw and cook on demand. The warehouse is your industrial kitchen — it handles the heavy lifting.

E
Extract
Pull raw data from
databases, APIs,
SaaS tools, files
L
Load
Write raw data
directly into the
data warehouse
T
Transform
Clean, model & shape
data using SQL
inside the warehouse
Raw Data First

Nothing is thrown away. Raw data lands in the warehouse exactly as it arrived. You can always re-transform if requirements change.

Warehouse as Engine

Transformation logic runs inside BigQuery, Snowflake, or Redshift — leveraging massively parallel compute you'd never get on a single server.

SQL as the Language

Analysts and engineers both speak SQL. ELT democratizes data transformation — you don't need Spark or Java to build production-grade models.

ELT vs ETL in one sentence: ETL transforms data before it reaches the destination. ELT transforms data after it lands. This seemingly small difference changes everything about how you architect, maintain, and scale a data platform.
02

ETL vs ELT

// UNDERSTANDING THE SHIFT

The traditional ETL pattern dominated data engineering for decades. It was born in an era when storage was expensive, cloud warehouses didn't exist, and transformation had to happen before data touched the destination. Cloud changed everything.

✕ ETL (Legacy Pattern)
  • Transform happens outside the warehouse (Spark, SSIS, Informatica)
  • Only clean, pre-modeled data enters the warehouse
  • Raw data often discarded — no historical re-processing
  • Changes to logic require re-running entire pipeline
  • Transformation code in Python, Java, Scala
  • Requires specialized engineering to maintain
  • Works well with on-premise databases and limited storage
✓ ELT (Modern Pattern)
  • Transform happens inside the warehouse using SQL
  • Raw data is preserved — full audit trail and replay
  • Change logic, re-run transform only — no re-extraction
  • SQL democratizes who can write transformations
  • Storage is cheap — keep everything raw
  • dbt, Dataform, or plain SQL handles transformation layer
  • Purpose-built for cloud warehouses (Snowflake, BQ, Redshift)
DimensionETLELT
Where does transform run?External server or middlewareInside the data warehouse
Raw data preserved?Usually notYes — always
Primary languagePython, Java, ScalaSQL (+ optional Python)
LatencyHigher — data processed before landingLower — load fast, transform later
Best forSensitive data needing masking before storage, complex ML pipelinesAnalytics, BI, reporting, most data teams
Typical toolsInformatica, SSIS, Spark, AWS GlueFivetran, Airbyte + dbt + Snowflake/BQ
Team skill requirementData engineers (Scala/Python)Data/analytics engineers (SQL)
Neither is "better" universally. ELT is the right default for modern analytics teams. ETL still wins when: data must be masked before storage for compliance (HIPAA, GDPR PII scrubbing), you're ingesting into a system without SQL-based compute, or your transformations are too complex for SQL alone.
03

When to Use ELT

// DECISION GUIDE

ELT is the right choice for the vast majority of modern data teams. Use this guide to decide. If you're building a new analytics platform from scratch on a cloud warehouse, ELT is your default answer.

✓ Choose ELT When Use This
  • You're using a cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks)
  • Your team knows SQL better than Python/Scala
  • You need an audit trail of raw data for debugging or re-processing
  • Your transformation logic changes frequently
  • You want analysts to write their own data models
  • You're building a modern analytics stack from scratch
  • Storage cost is not a constraint (cloud storage is cheap)
  • You need fast initial data availability before transform is complete
✗ Reconsider ELT When Evaluate
  • PII or sensitive data must be masked before reaching the warehouse
  • Destination system lacks SQL compute (e.g., flat file store, old RDBMS)
  • Your transformations require complex ML or non-SQL logic as a first step
  • Extreme low-latency requirements (sub-second streaming with complex transforms)
  • Regulatory requirements mandate data never lands raw
  • Your data volumes are tiny and a simple script is more pragmatic

ELT by Use Case

Use CaseELT Fit?Why
Business Intelligence / DashboardsIDEALAnalytics queries run against transformed warehouse models — exactly what ELT produces
Marketing AttributionIDEALMultiple SaaS sources (Google Ads, Salesforce, HubSpot) unified in warehouse via ELT
Financial ReportingSTRONGRaw transaction data preserved, transformations versioned and auditable
ML Feature EngineeringPARTIALELT for feature tables; Python/Spark may still be needed for complex ML preprocessing
Real-Time StreamingPARTIALLoading streaming data works; sub-second transformation may need stream processors (Flink, Spark Streaming)
HIPAA Data with PIICAUTIONMust mask/de-identify PII before it lands — use ETL preprocessing or tokenization at extract stage
04

Extract

// PULLING DATA FROM THE SOURCE

The Extract step connects to your data sources and pulls records. This sounds simple but is where most pipeline complexity lives: authentication, rate limits, pagination, incremental vs full loads, schema drift, and API throttling are all extract-layer problems.

Full Extract

Pull the entire table or dataset every run. Simple but expensive for large tables. Use for small, frequently-changing reference tables.

Incremental Extract

Pull only records that changed since the last run. Uses a watermark (updated_at timestamp or auto-increment ID). Scales to large tables. More complex to implement correctly.

CDC (Change Data Capture)

Read the database transaction log (binlog, WAL) to capture every insert, update, and delete. Near-real-time. Used with tools like Debezium. No load on source database from queries.

Common Data Sources

Source TypeExamplesExtraction MethodChallenges
Relational DBsPostgres, MySQL, SQL ServerSQL queries, CDC (Debezium), JDBCLoad on source DB, schema changes
REST APIsStripe, Salesforce, HubSpot, ShopifyHTTP GET with pagination, OAuthRate limits, pagination, auth token rotation
Event StreamsKafka, Kinesis, Pub/SubConsumer groups, connectorsOrdering guarantees, duplicate events
Flat FilesCSV, JSON, Parquet, ExcelS3/GCS listeners, SFTP pollingSchema inconsistency, missing values, encoding
SaaS AppsGoogle Analytics, Marketo, ZendeskManaged connectors (Fivetran, Airbyte)Vendor API changes, undocumented schemas
NoSQL / DocumentsMongoDB, DynamoDB, FirestoreDump + parse, CDC, change streamsSchema-less = inconsistent structures
Python — Incremental API Extract
# Example: extract orders from a REST API incrementally import requests, json from datetime import datetime, timedelta def extract_orders(last_run_at: str) -> list[dict]: """Pull orders updated since last_run_at (ISO format).""" all_records = [] page = 1 while True: resp = requests.get( "https://api.mystore.com/v1/orders", headers={"Authorization": f"Bearer {API_KEY}"}, params={ "updated_after": last_run_at, # incremental watermark "page": page, "per_page": 100 } ) resp.raise_for_status() data = resp.json() all_records.extend(data["orders"]) if not data["has_more"]: # stop when no more pages break page += 1 return all_records # Run and store watermark for next incremental load orders = extract_orders(last_run_at="2024-11-01T00:00:00Z") print(f"Extracted {len(orders)} records") # Extracted 1,247 records
Managed connectors save weeks of work. Tools like Fivetran and Airbyte handle authentication, pagination, rate limiting, incremental syncing, and schema changes for 300+ sources out of the box. For standard SaaS sources, always use a managed connector unless you have a very specific reason not to.
05

Load

// LANDING RAW DATA IN THE WAREHOUSE

Loading is the act of writing extracted data into your destination. In ELT, you load raw, unmodified data into a dedicated raw schema. Speed and fidelity are the goals here — you are not cleaning anything yet.

Load Strategies

Append / Insert Only Simple

Each run appends new records. Historical runs are fully preserved. Great for event logs, audit tables, and immutable data like transactions. Requires deduplication in the transform layer.

Upsert (Merge) Common

Insert new records, update existing ones based on a primary key. Keeps the table current without duplicates. Most common for entity tables (users, products, accounts).

Full Refresh Reliable

Truncate and reload the entire table every run. Simple and avoids drift. Works well for small tables (<1M rows). High cost for large tables due to re-scanning everything.

Partitioned Load Scalable

Write data into date/time-based partitions. Enables efficient partition-pruning in queries. Load replaces only the relevant partition. Standard pattern for large event tables in BigQuery.

SQL — Upsert Pattern (Snowflake MERGE)
-- Load raw orders: upsert into raw layer using MERGE -- Staging table: raw_stage.orders_stage (just-loaded batch) -- Target table: raw.orders (accumulating raw records) MERGE INTO raw.orders AS target USING raw_stage.orders_stage AS source ON target.order_id = source.order_id -- match on primary key WHEN MATCHED THEN UPDATE SET target.status = source.status, target.updated_at = source.updated_at, target.total_amount = source.total_amount, target._loaded_at = CURRENT_TIMESTAMP() WHEN NOT MATCHED THEN INSERT ( order_id, customer_id, status, total_amount, created_at, updated_at, _loaded_at ) VALUES ( source.order_id, source.customer_id, source.status, source.total_amount, source.created_at, source.updated_at, CURRENT_TIMESTAMP() ); -- Convention: prefix metadata cols with _ to distinguish from source cols -- _loaded_at: when did this record arrive in the warehouse? -- _extracted_at: when did the extractor pull it from source?

Raw Layer Conventions

Raw schema design rules: Keep it boring. (1) One schema per source system: raw_stripe, raw_salesforce. (2) Never modify raw data — it is the source of truth. (3) Add metadata columns: _loaded_at, _source_file, _batch_id. (4) Store as-is, even nulls and broken values — those are data quality issues to solve in transform. (5) Grant read-only access to everyone, write access only to the pipeline.
06

Transform

// MAKING RAW DATA USEFUL

Transformation is where raw data becomes business-ready. In ELT, this happens entirely inside the warehouse using SQL — and almost always with a tool called dbt (data build tool). Transforms are organized into layers.

The Three-Layer Model

🗄️
Raw
raw_stripe.*
🔧
Staging
stg_stripe.*
🧱
Intermediate
int_orders.*
📊
Marts
fct_*, dim_*
📈
BI Tools
Looker, Tableau
Staging Layer

1:1 with raw tables. Light cleaning only: rename columns, cast types, filter deleted records. No business logic. One staging model per source table.

  • Rename amtamount_usd
  • Cast VARCHARDATE
  • Exclude soft-deleted rows
  • Standardize booleans ('Y'/'N'TRUE/FALSE)
Intermediate Layer

Business logic lives here. Join staging tables, compute derived fields, build reusable building blocks. Not exposed to analysts directly.

  • Join orders + customers
  • Compute order lifetime value
  • Resolve slowly-changing dimensions
  • Aggregate daily events
Marts Layer

Final, business-facing models organized by domain. Fact tables (events) and dimension tables (entities). This is what analysts and BI tools query.

  • fct_orders — one row per order
  • dim_customers — one row per customer
  • fct_revenue_daily — aggregated
  • mart_sales — domain-wide bundle
SQL (dbt) — Staging Model
-- models/staging/stg_stripe__charges.sql -- Staging model: clean raw Stripe charges data WITH source AS ( SELECT * FROM {{ source('stripe', 'charges') }} ), renamed AS ( SELECT -- Primary key id AS charge_id, -- Relationships customer AS customer_id, invoice AS invoice_id, -- Amounts: Stripe stores cents, convert to dollars amount / 100.0 AS amount_usd, amount_refunded / 100.0 AS amount_refunded_usd, currency AS currency_code, -- Status status, captured AS is_captured, paid AS is_paid, refunded AS is_refunded, -- Timestamps TO_TIMESTAMP(created) AS created_at, -- Metadata _loaded_at FROM source WHERE status != 'failed' -- exclude failed charges from start ) SELECT * FROM renamed
SQL (dbt) — Fact Table Model
-- models/marts/finance/fct_revenue.sql -- Fact table: one row per revenue event WITH charges AS ( SELECT * FROM {{ ref('stg_stripe__charges') }} WHERE is_paid = TRUE ), customers AS ( SELECT * FROM {{ ref('dim_customers') }} ), final AS ( SELECT c.charge_id, c.customer_id, cust.customer_segment, cust.country_code, c.amount_usd, c.amount_refunded_usd, c.amount_usd - c.amount_refunded_usd AS net_revenue_usd, c.currency_code, c.created_at, DATE_TRUNC('month', c.created_at) AS revenue_month FROM charges c LEFT JOIN customers cust ON c.customer_id = cust.customer_id ) SELECT * FROM final
07

Tools & Ecosystem

// THE MODERN DATA STACK

The modern ELT stack has a clear separation of concerns: ingestion tools (Extract + Load), a cloud warehouse (storage + compute), and a transformation framework (Transform). You pick one from each category.

LayerToolWhat It DoesBest For
Extract + Load Fivetran Managed connectors for 300+ SaaS/DB sources. Fully automated, handles schema changes, incremental syncing, log-based CDC. Teams that want zero-maintenance pipelines. Highest reliability but expensive.
Extract + Load Airbyte Open-source alternative to Fivetran with 350+ connectors. Can self-host or use Airbyte Cloud. More flexible, lower cost. Teams comfortable with ops overhead who want flexibility and cost control.
Extract + Load Stitch Simple, lightweight SaaS ingestion. Good for getting started quickly at moderate scale. Startups and smaller teams that don't need Fivetran's full feature set.
Warehouse Snowflake Cloud data warehouse. Separates storage and compute. Near-unlimited scale. Strong SQL support, time-travel, zero-copy cloning. Enterprise and mid-market. Best-in-class features. Premium pricing.
Warehouse BigQuery Google's serverless data warehouse. Pay-per-query model. Native ML (BQML). Best integration with GCP ecosystem. Google Cloud shops. Serverless simplicity. Often cheapest at high scale.
Warehouse Redshift Amazon's columnar warehouse. Deep AWS integration. Spectrum for S3 querying. AWS-native stacks. Good for teams already invested in the AWS ecosystem.
Warehouse Databricks Lakehouse platform combining data lake + warehouse. Strong for ML workloads, Delta Lake format. Data science-heavy teams that need both analytics and ML at scale.
Transform dbt The standard tool for SQL-based transformation. Models as .sql files, version control, tests, docs, lineage. Open-source core + dbt Cloud (managed). Almost every modern data team. The de facto standard for ELT transforms.
Transform Dataform Google's transformation tool (acquired 2020). Similar to dbt, now part of BigQuery. SQLX syntax. Teams fully on GCP/BigQuery who prefer native Google tooling.
Orchestration Airflow Open-source workflow orchestrator. Define DAGs in Python. Schedules and monitors pipeline runs. Industry standard. Teams needing complex scheduling, dependencies, and custom operators.
Orchestration Dagster / Prefect Modern Airflow alternatives with better developer experience, observability, and asset-centric thinking. New pipelines where developer UX and observability are priorities.
The starter stack: If you're building your first ELT platform, start with Airbyte (free ingestion) + BigQuery (generous free tier) + dbt Core (free). This stack costs nearly nothing to start, scales to millions of rows, and is industry-standard. Graduate to Fivetran + Snowflake when you need managed reliability.
08

Real-World Examples

// ELT IN PRACTICE

The best way to understand ELT is to walk through complete scenarios. Below are three worked examples spanning common use cases: e-commerce analytics, SaaS metrics, and marketing attribution.

Example 1 — E-Commerce Revenue Pipeline

An e-commerce company wants to track daily revenue, refund rates, and customer lifetime value from Shopify and Stripe.

🛒
Shopify
Orders, Products
💳
Stripe
Payments, Refunds
📦
Fivetran
Extract + Load
❄️
Snowflake
Raw Schema
🔧
dbt
Staging → Marts
📊
Looker
Dashboards
SQL (dbt) — Customer Lifetime Value Model
-- models/marts/ecommerce/dim_customers_with_clv.sql -- Builds a customer dimension enriched with lifetime value metrics WITH customers AS ( SELECT * FROM {{ ref('stg_shopify__customers') }} ), orders AS ( SELECT * FROM {{ ref('stg_shopify__orders') }} WHERE financial_status = 'paid' ), -- Aggregate order history per customer customer_orders AS ( SELECT customer_id, COUNT(*) AS total_orders, SUM(total_price_usd) AS total_revenue_usd, AVG(total_price_usd) AS avg_order_value_usd, MIN(created_at) AS first_order_at, MAX(created_at) AS last_order_at, DATEDIFF('day', MIN(created_at), MAX(created_at)) AS customer_lifespan_days FROM orders GROUP BY 1 ) SELECT c.customer_id, c.email, c.first_name, c.last_name, c.country_code, c.created_at AS customer_since, -- CLV metrics COALESCE(o.total_orders, 0) AS lifetime_orders, COALESCE(o.total_revenue_usd, 0) AS lifetime_revenue_usd, COALESCE(o.avg_order_value_usd, 0) AS avg_order_value_usd, o.first_order_at, o.last_order_at, o.customer_lifespan_days, -- Segment customers by value CASE WHEN o.total_revenue_usd >= 1000 THEN 'High Value' WHEN o.total_revenue_usd >= 200 THEN 'Mid Value' WHEN o.total_revenue_usd > 0 THEN 'Low Value' ELSE 'No Purchase' END AS customer_segment FROM customers c LEFT JOIN customer_orders o ON c.customer_id = o.customer_id

Example 2 — SaaS Product Metrics (MRR & Churn)

SQL (dbt) — Monthly Recurring Revenue
-- models/marts/finance/fct_mrr.sql -- Calculates MRR snapshots and flags new/expansion/contraction/churn WITH subscriptions AS ( SELECT * FROM {{ ref('stg_stripe__subscriptions') }} ), -- Generate one row per subscription per month it was active monthly_subs AS ( SELECT subscription_id, customer_id, plan_name, monthly_amount_usd, DATE_TRUNC('month', month_date) AS snapshot_month FROM subscriptions, LATERAL FLATTEN( INPUT => ARRAY_GENERATE_RANGE( DATE_TRUNC('month', started_at), DATE_TRUNC('month', COALESCE(ended_at, CURRENT_DATE())), 1, 'month' ) ) month_date WHERE status != 'incomplete' ), -- Compare to previous month to classify MRR movement mrr_with_lag AS ( SELECT *, LAG(monthly_amount_usd) OVER ( PARTITION BY subscription_id ORDER BY snapshot_month ) AS prev_month_mrr FROM monthly_subs ) SELECT snapshot_month, subscription_id, customer_id, plan_name, monthly_amount_usd AS mrr, CASE WHEN prev_month_mrr IS NULL THEN 'new' WHEN monthly_amount_usd > prev_month_mrr THEN 'expansion' WHEN monthly_amount_usd < prev_month_mrr THEN 'contraction' ELSE 'retained' END AS mrr_type FROM mrr_with_lag

Example 3 — Marketing Attribution

SQL (dbt) — Last-Touch Attribution
-- models/marts/marketing/fct_attribution_last_touch.sql -- Credits revenue to the last marketing touchpoint before conversion WITH conversions AS ( SELECT user_id, order_id, revenue_usd, converted_at FROM {{ ref('fct_revenue') }} ), touchpoints AS ( SELECT user_id, channel, campaign, touched_at FROM {{ ref('stg_segment__events') }} WHERE event_type = 'ad_click' ), -- Find the last touchpoint before each conversion last_touch AS ( SELECT c.order_id, c.user_id, c.revenue_usd, c.converted_at, t.channel, t.campaign, ROW_NUMBER() OVER ( PARTITION BY c.order_id ORDER BY t.touched_at DESC -- most recent touch first ) AS touch_rank FROM conversions c LEFT JOIN touchpoints t ON c.user_id = t.user_id AND t.touched_at < c.converted_at -- only touchpoints BEFORE conversion AND t.touched_at >= DATEADD('day', -30, c.converted_at) -- 30-day window ) SELECT order_id, user_id, revenue_usd, converted_at, COALESCE(channel, 'direct') AS attributed_channel, COALESCE(campaign, '(none)') AS attributed_campaign FROM last_touch WHERE touch_rank = 1 -- keep only the last touch per order
09

Design Patterns

// PROVEN APPROACHES FOR COMMON SCENARIOS
Slowly Changing Dimensions SCD Type 2

Track how entity attributes change over time. Instead of overwriting, add a new row with valid_from / valid_to timestamps. Query at a point-in-time by joining on the date range.

Example: a customer changes their country. SCD Type 2 preserves the old record so you can analyze what segment they were in at time of purchase.

Idempotent Pipelines Critical

Running your pipeline twice should produce the same result as running it once. Achieve this by: (1) using MERGE/upsert not raw INSERT, (2) partitioned overwrites, (3) deduplication keys. Without idempotency, re-runs double-count data.

Snapshot Tables dbt Built-in

dbt Snapshots automatically capture historical states of mutable tables. Point dbt at a source table, define a strategy (timestamp or check), and dbt handles SCD Type 2 logic for you with dbt snapshot.

Incremental Models Performance

For large tables, dbt incremental models only process new/changed rows on each run instead of rebuilding the full table. Reduces warehouse costs dramatically at scale. Add {{ config(materialized='incremental') }} to your model.

SQL (dbt) — Incremental Model Pattern
-- models/marts/fct_events.sql -- Incremental model: only processes new events on each run {{ config( materialized='incremental', unique_key='event_id', -- dedup key for upsert on_schema_change='sync_all_columns' -- handle new columns gracefully ) }} WITH source AS ( SELECT * FROM {{ ref('stg_segment__events') }} -- This block only runs on incremental runs (not full refresh) {% if is_incremental() %} WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }}) {% endif %} ) SELECT event_id, user_id, event_type, properties, occurred_at FROM source
10

Data Quality

// TRUST YOUR PIPELINE

A pipeline that produces wrong data is worse than no pipeline. Data quality testing should be built into every layer — not bolted on later. dbt ships with a built-in testing framework that makes this easy.

dbt Generic Tests Built-in
  • not_null — column must never be null
  • unique — column values must be distinct
  • accepted_values — column must be in a defined set
  • relationships — foreign key must exist in referenced table
Custom & Advanced Tests Beyond Basics
  • Row count checks — did the table suddenly lose 50% of rows?
  • Freshness checks — is the most recent record too old?
  • Distribution tests — is revenue_usd within expected range?
  • Cross-model consistency — do order counts match between tables?
YAML — dbt Tests Configuration
# models/marts/ecommerce/schema.yml version: 2 models: - name: fct_orders description: "One row per paid order" columns: - name: order_id description: "Unique identifier for each order" tests: - unique # no duplicate orders - not_null # every order must have an ID - name: customer_id tests: - not_null - relationships: to: ref('dim_customers') field: customer_id # foreign key check - name: status tests: - accepted_values: values: ['pending', 'paid', 'refunded', 'cancelled'] - name: total_amount_usd tests: - not_null - dbt_utils.accepted_range: # custom range test (dbt_utils package) min_value: 0 max_value: 100000 # flag suspiciously large amounts
11

Common Pitfalls

// MISTAKES EVERY TEAM MAKES
Not Testing Your Pipeline Danger

The #1 mistake. Untested pipelines silently produce wrong numbers. By the time you notice, six months of dashboards are wrong. Add dbt tests from day one — it takes 10 minutes and saves weeks.

Transforming in the Extract Step Anti-Pattern

In ELT, the extractor should be dumb — just move data. If you start adding business logic in your Airbyte custom transformations or Python extractor, you're rebuilding ETL and losing ELT's advantages.

Fan-Out Joins (Query Explosion) Watch Out

Joining a table on a non-unique key multiplies rows. Always verify join cardinality. Add a unique test on your staging model's primary key before building on top of it.

Ignoring Schema Drift Common

Source APIs add, rename, or remove columns without warning. Your pipeline breaks silently. Configure your ingestion tool to alert on schema changes, and add on_schema_change logic in dbt incremental models.

No Watermark / Non-Idempotent Loads Data Quality

If your load step uses raw INSERT without deduplication, re-running the pipeline duplicates data. Always design loads to be safe to re-run: MERGE, partitioned overwrites, or truncate-and-reload.

Monolithic "God" dbt Models Maintainability

A 400-line SQL model that does everything is impossible to debug or extend. Break complex logic into intermediate models with clear names. Each model should do one thing. The three-layer pattern (staging → intermediate → marts) enforces this.

Never query the raw layer in BI tools. Analysts and dashboards should only query from the marts layer. If raw data lands in a dashboard, the next schema change upstream breaks the dashboard. The raw layer is internal infrastructure — marts are the public API of your data platform.
12

Learning Roadmap

// YOUR PATH FROM BEGINNER TO PRACTITIONER

ELT is a skill that compounds. The table below maps a realistic learning path — start at your current level and work forward. Most people go from beginner to job-ready in 3–4 months of consistent practice.

Stage 01
Foundation
  • SQL basics — SELECT, JOIN, GROUP BY, window functions
  • Understand ETL vs ELT conceptually
  • Set up BigQuery free tier + connect Airbyte
  • Load a CSV and write your first query
  • Learn what a data warehouse is vs a transactional DB
Stage 02
Practitioner
  • Build dbt project: staging + one mart model
  • Write dbt tests (not_null, unique, accepted_values)
  • Understand incremental vs full-refresh models
  • Connect a real SaaS source via Airbyte/Fivetran
  • Schedule a pipeline with Airflow or Dagster
Stage 03
Intermediate
  • Implement SCD Type 2 with dbt Snapshots
  • Build a full three-layer dbt project (stg → int → mart)
  • Understand fact vs dimension table patterns
  • Add custom data quality tests + alerting
  • Profile and optimize slow warehouse queries
Stage 04
Advanced
  • Design a multi-source analytical data platform
  • Implement CDC with Debezium + Kafka
  • Build a data contract between teams
  • Deploy dbt in CI/CD with PR-based testing
  • Data observability with Monte Carlo or Elementary
Resources to go deeper: docs.getdbt.com (the best free data engineering education) · airbyte.com/tutorials (connector walkthroughs) · docs.snowflake.com · Fundamentals of Data Engineering by Joe Reis & Matt Housley (book) · The dbt Viewpoint (dbt Labs blog) · github.com/dbt-labs/jaffle-shop (canonical beginner dbt project to clone and explore).
ResourceTypeBest For
dbt Learn (learn.getdbt.com)Free online courseBest structured intro to dbt and ELT patterns
Jaffle Shop (dbt Labs)GitHub projectReference dbt project to clone and learn from
BigQuery SandboxFree cloud servicePractice SQL and dbt with no cost for first 10GB/month
Airbyte QuickstartTutorialSet up your first ELT pipeline in an afternoon
Fundamentals of Data EngineeringBook (O'Reilly)Comprehensive overview of the full data engineering landscape
dbt Slack CommunityCommunityBest place to get unstuck, 50,000+ members