ELT
Unpacked
A practical guide to the modern data pipeline pattern powering analytics, machine learning, and business intelligence at scale — written for engineers at every level.
What is ELT?
ELT stands for Extract, Load, Transform. It is a data integration pattern where raw data is first pulled from source systems, loaded directly into a destination (usually a cloud data warehouse), and then transformed in-place using the warehouse's own compute power.
Think of it like this: instead of cooking your ingredients before putting them in the fridge, you store everything raw and cook on demand. The warehouse is your industrial kitchen — it handles the heavy lifting.
databases, APIs,
SaaS tools, files
directly into the
data warehouse
data using SQL
inside the warehouse
Nothing is thrown away. Raw data lands in the warehouse exactly as it arrived. You can always re-transform if requirements change.
Transformation logic runs inside BigQuery, Snowflake, or Redshift — leveraging massively parallel compute you'd never get on a single server.
Analysts and engineers both speak SQL. ELT democratizes data transformation — you don't need Spark or Java to build production-grade models.
ETL vs ELT
The traditional ETL pattern dominated data engineering for decades. It was born in an era when storage was expensive, cloud warehouses didn't exist, and transformation had to happen before data touched the destination. Cloud changed everything.
- Transform happens outside the warehouse (Spark, SSIS, Informatica)
- Only clean, pre-modeled data enters the warehouse
- Raw data often discarded — no historical re-processing
- Changes to logic require re-running entire pipeline
- Transformation code in Python, Java, Scala
- Requires specialized engineering to maintain
- Works well with on-premise databases and limited storage
- Transform happens inside the warehouse using SQL
- Raw data is preserved — full audit trail and replay
- Change logic, re-run transform only — no re-extraction
- SQL democratizes who can write transformations
- Storage is cheap — keep everything raw
- dbt, Dataform, or plain SQL handles transformation layer
- Purpose-built for cloud warehouses (Snowflake, BQ, Redshift)
| Dimension | ETL | ELT |
|---|---|---|
| Where does transform run? | External server or middleware | Inside the data warehouse |
| Raw data preserved? | Usually not | Yes — always |
| Primary language | Python, Java, Scala | SQL (+ optional Python) |
| Latency | Higher — data processed before landing | Lower — load fast, transform later |
| Best for | Sensitive data needing masking before storage, complex ML pipelines | Analytics, BI, reporting, most data teams |
| Typical tools | Informatica, SSIS, Spark, AWS Glue | Fivetran, Airbyte + dbt + Snowflake/BQ |
| Team skill requirement | Data engineers (Scala/Python) | Data/analytics engineers (SQL) |
When to Use ELT
ELT is the right choice for the vast majority of modern data teams. Use this guide to decide. If you're building a new analytics platform from scratch on a cloud warehouse, ELT is your default answer.
- You're using a cloud data warehouse (BigQuery, Snowflake, Redshift, Databricks)
- Your team knows SQL better than Python/Scala
- You need an audit trail of raw data for debugging or re-processing
- Your transformation logic changes frequently
- You want analysts to write their own data models
- You're building a modern analytics stack from scratch
- Storage cost is not a constraint (cloud storage is cheap)
- You need fast initial data availability before transform is complete
- PII or sensitive data must be masked before reaching the warehouse
- Destination system lacks SQL compute (e.g., flat file store, old RDBMS)
- Your transformations require complex ML or non-SQL logic as a first step
- Extreme low-latency requirements (sub-second streaming with complex transforms)
- Regulatory requirements mandate data never lands raw
- Your data volumes are tiny and a simple script is more pragmatic
ELT by Use Case
| Use Case | ELT Fit? | Why |
|---|---|---|
| Business Intelligence / Dashboards | IDEAL | Analytics queries run against transformed warehouse models — exactly what ELT produces |
| Marketing Attribution | IDEAL | Multiple SaaS sources (Google Ads, Salesforce, HubSpot) unified in warehouse via ELT |
| Financial Reporting | STRONG | Raw transaction data preserved, transformations versioned and auditable |
| ML Feature Engineering | PARTIAL | ELT for feature tables; Python/Spark may still be needed for complex ML preprocessing |
| Real-Time Streaming | PARTIAL | Loading streaming data works; sub-second transformation may need stream processors (Flink, Spark Streaming) |
| HIPAA Data with PII | CAUTION | Must mask/de-identify PII before it lands — use ETL preprocessing or tokenization at extract stage |
Extract
The Extract step connects to your data sources and pulls records. This sounds simple but is where most pipeline complexity lives: authentication, rate limits, pagination, incremental vs full loads, schema drift, and API throttling are all extract-layer problems.
Pull the entire table or dataset every run. Simple but expensive for large tables. Use for small, frequently-changing reference tables.
Pull only records that changed since the last run. Uses a watermark (updated_at timestamp or auto-increment ID). Scales to large tables. More complex to implement correctly.
Read the database transaction log (binlog, WAL) to capture every insert, update, and delete. Near-real-time. Used with tools like Debezium. No load on source database from queries.
Common Data Sources
| Source Type | Examples | Extraction Method | Challenges |
|---|---|---|---|
| Relational DBs | Postgres, MySQL, SQL Server | SQL queries, CDC (Debezium), JDBC | Load on source DB, schema changes |
| REST APIs | Stripe, Salesforce, HubSpot, Shopify | HTTP GET with pagination, OAuth | Rate limits, pagination, auth token rotation |
| Event Streams | Kafka, Kinesis, Pub/Sub | Consumer groups, connectors | Ordering guarantees, duplicate events |
| Flat Files | CSV, JSON, Parquet, Excel | S3/GCS listeners, SFTP polling | Schema inconsistency, missing values, encoding |
| SaaS Apps | Google Analytics, Marketo, Zendesk | Managed connectors (Fivetran, Airbyte) | Vendor API changes, undocumented schemas |
| NoSQL / Documents | MongoDB, DynamoDB, Firestore | Dump + parse, CDC, change streams | Schema-less = inconsistent structures |
Load
Loading is the act of writing extracted data into your destination. In ELT, you load raw, unmodified data into a dedicated raw schema. Speed and fidelity are the goals here — you are not cleaning anything yet.
Load Strategies
Each run appends new records. Historical runs are fully preserved. Great for event logs, audit tables, and immutable data like transactions. Requires deduplication in the transform layer.
Insert new records, update existing ones based on a primary key. Keeps the table current without duplicates. Most common for entity tables (users, products, accounts).
Truncate and reload the entire table every run. Simple and avoids drift. Works well for small tables (<1M rows). High cost for large tables due to re-scanning everything.
Write data into date/time-based partitions. Enables efficient partition-pruning in queries. Load replaces only the relevant partition. Standard pattern for large event tables in BigQuery.
Raw Layer Conventions
raw_stripe, raw_salesforce. (2) Never modify raw data — it is the source of truth. (3) Add metadata columns: _loaded_at, _source_file, _batch_id. (4) Store as-is, even nulls and broken values — those are data quality issues to solve in transform. (5) Grant read-only access to everyone, write access only to the pipeline.Transform
Transformation is where raw data becomes business-ready. In ELT, this happens entirely inside the warehouse using SQL — and almost always with a tool called dbt (data build tool). Transforms are organized into layers.
The Three-Layer Model
1:1 with raw tables. Light cleaning only: rename columns, cast types, filter deleted records. No business logic. One staging model per source table.
- Rename
amt→amount_usd - Cast
VARCHAR→DATE - Exclude soft-deleted rows
- Standardize booleans (
'Y'/'N'→TRUE/FALSE)
Business logic lives here. Join staging tables, compute derived fields, build reusable building blocks. Not exposed to analysts directly.
- Join orders + customers
- Compute order lifetime value
- Resolve slowly-changing dimensions
- Aggregate daily events
Final, business-facing models organized by domain. Fact tables (events) and dimension tables (entities). This is what analysts and BI tools query.
fct_orders— one row per orderdim_customers— one row per customerfct_revenue_daily— aggregatedmart_sales— domain-wide bundle
Tools & Ecosystem
The modern ELT stack has a clear separation of concerns: ingestion tools (Extract + Load), a cloud warehouse (storage + compute), and a transformation framework (Transform). You pick one from each category.
| Layer | Tool | What It Does | Best For |
|---|---|---|---|
| Extract + Load | Fivetran | Managed connectors for 300+ SaaS/DB sources. Fully automated, handles schema changes, incremental syncing, log-based CDC. | Teams that want zero-maintenance pipelines. Highest reliability but expensive. |
| Extract + Load | Airbyte | Open-source alternative to Fivetran with 350+ connectors. Can self-host or use Airbyte Cloud. More flexible, lower cost. | Teams comfortable with ops overhead who want flexibility and cost control. |
| Extract + Load | Stitch | Simple, lightweight SaaS ingestion. Good for getting started quickly at moderate scale. | Startups and smaller teams that don't need Fivetran's full feature set. |
| Warehouse | Snowflake | Cloud data warehouse. Separates storage and compute. Near-unlimited scale. Strong SQL support, time-travel, zero-copy cloning. | Enterprise and mid-market. Best-in-class features. Premium pricing. |
| Warehouse | BigQuery | Google's serverless data warehouse. Pay-per-query model. Native ML (BQML). Best integration with GCP ecosystem. | Google Cloud shops. Serverless simplicity. Often cheapest at high scale. |
| Warehouse | Redshift | Amazon's columnar warehouse. Deep AWS integration. Spectrum for S3 querying. | AWS-native stacks. Good for teams already invested in the AWS ecosystem. |
| Warehouse | Databricks | Lakehouse platform combining data lake + warehouse. Strong for ML workloads, Delta Lake format. | Data science-heavy teams that need both analytics and ML at scale. |
| Transform | dbt | The standard tool for SQL-based transformation. Models as .sql files, version control, tests, docs, lineage. Open-source core + dbt Cloud (managed). | Almost every modern data team. The de facto standard for ELT transforms. |
| Transform | Dataform | Google's transformation tool (acquired 2020). Similar to dbt, now part of BigQuery. SQLX syntax. | Teams fully on GCP/BigQuery who prefer native Google tooling. |
| Orchestration | Airflow | Open-source workflow orchestrator. Define DAGs in Python. Schedules and monitors pipeline runs. Industry standard. | Teams needing complex scheduling, dependencies, and custom operators. |
| Orchestration | Dagster / Prefect | Modern Airflow alternatives with better developer experience, observability, and asset-centric thinking. | New pipelines where developer UX and observability are priorities. |
Real-World Examples
The best way to understand ELT is to walk through complete scenarios. Below are three worked examples spanning common use cases: e-commerce analytics, SaaS metrics, and marketing attribution.
Example 1 — E-Commerce Revenue Pipeline
An e-commerce company wants to track daily revenue, refund rates, and customer lifetime value from Shopify and Stripe.
Example 2 — SaaS Product Metrics (MRR & Churn)
Example 3 — Marketing Attribution
Design Patterns
Track how entity attributes change over time. Instead of overwriting, add a new row with valid_from / valid_to timestamps. Query at a point-in-time by joining on the date range.
Example: a customer changes their country. SCD Type 2 preserves the old record so you can analyze what segment they were in at time of purchase.
Running your pipeline twice should produce the same result as running it once. Achieve this by: (1) using MERGE/upsert not raw INSERT, (2) partitioned overwrites, (3) deduplication keys. Without idempotency, re-runs double-count data.
dbt Snapshots automatically capture historical states of mutable tables. Point dbt at a source table, define a strategy (timestamp or check), and dbt handles SCD Type 2 logic for you with dbt snapshot.
For large tables, dbt incremental models only process new/changed rows on each run instead of rebuilding the full table. Reduces warehouse costs dramatically at scale. Add {{ config(materialized='incremental') }} to your model.
Data Quality
A pipeline that produces wrong data is worse than no pipeline. Data quality testing should be built into every layer — not bolted on later. dbt ships with a built-in testing framework that makes this easy.
not_null— column must never be nullunique— column values must be distinctaccepted_values— column must be in a defined setrelationships— foreign key must exist in referenced table
- Row count checks — did the table suddenly lose 50% of rows?
- Freshness checks — is the most recent record too old?
- Distribution tests — is revenue_usd within expected range?
- Cross-model consistency — do order counts match between tables?
Common Pitfalls
The #1 mistake. Untested pipelines silently produce wrong numbers. By the time you notice, six months of dashboards are wrong. Add dbt tests from day one — it takes 10 minutes and saves weeks.
In ELT, the extractor should be dumb — just move data. If you start adding business logic in your Airbyte custom transformations or Python extractor, you're rebuilding ETL and losing ELT's advantages.
Joining a table on a non-unique key multiplies rows. Always verify join cardinality. Add a unique test on your staging model's primary key before building on top of it.
Source APIs add, rename, or remove columns without warning. Your pipeline breaks silently. Configure your ingestion tool to alert on schema changes, and add on_schema_change logic in dbt incremental models.
If your load step uses raw INSERT without deduplication, re-running the pipeline duplicates data. Always design loads to be safe to re-run: MERGE, partitioned overwrites, or truncate-and-reload.
A 400-line SQL model that does everything is impossible to debug or extend. Break complex logic into intermediate models with clear names. Each model should do one thing. The three-layer pattern (staging → intermediate → marts) enforces this.
Learning Roadmap
ELT is a skill that compounds. The table below maps a realistic learning path — start at your current level and work forward. Most people go from beginner to job-ready in 3–4 months of consistent practice.
- SQL basics — SELECT, JOIN, GROUP BY, window functions
- Understand ETL vs ELT conceptually
- Set up BigQuery free tier + connect Airbyte
- Load a CSV and write your first query
- Learn what a data warehouse is vs a transactional DB
- Build dbt project: staging + one mart model
- Write dbt tests (not_null, unique, accepted_values)
- Understand incremental vs full-refresh models
- Connect a real SaaS source via Airbyte/Fivetran
- Schedule a pipeline with Airflow or Dagster
- Implement SCD Type 2 with dbt Snapshots
- Build a full three-layer dbt project (stg → int → mart)
- Understand fact vs dimension table patterns
- Add custom data quality tests + alerting
- Profile and optimize slow warehouse queries
- Design a multi-source analytical data platform
- Implement CDC with Debezium + Kafka
- Build a data contract between teams
- Deploy dbt in CI/CD with PR-based testing
- Data observability with Monte Carlo or Elementary
docs.getdbt.com (the best free data engineering education) · airbyte.com/tutorials (connector walkthroughs) · docs.snowflake.com · Fundamentals of Data Engineering by Joe Reis & Matt Housley (book) · The dbt Viewpoint (dbt Labs blog) · github.com/dbt-labs/jaffle-shop (canonical beginner dbt project to clone and explore).| Resource | Type | Best For |
|---|---|---|
| dbt Learn (learn.getdbt.com) | Free online course | Best structured intro to dbt and ELT patterns |
| Jaffle Shop (dbt Labs) | GitHub project | Reference dbt project to clone and learn from |
| BigQuery Sandbox | Free cloud service | Practice SQL and dbt with no cost for first 10GB/month |
| Airbyte Quickstart | Tutorial | Set up your first ELT pipeline in an afternoon |
| Fundamentals of Data Engineering | Book (O'Reilly) | Comprehensive overview of the full data engineering landscape |
| dbt Slack Community | Community | Best place to get unstuck, 50,000+ members |