a production-grade ELT pipeline requires deliberate architecture, not just reordered steps. The following implementation demonstrates a TypeScript-based extraction layer, cloud storage staging, and warehouse-native transformation orchestration.
Architecture Decisions and Rationale
- Raw Zone First: All extracted data lands in an immutable raw zone partitioned by ingestion timestamp. This preserves auditability and enables replay.
- Transformation in Warehouse: SQL-based transformations run inside the CDW using a framework like dbt. This leverages MPP parallelism, columnar pruning, and native optimization.
- Idempotent Loads: Extractors use upsert/merge semantics or partition overwrite to guarantee exactly-once delivery despite retries.
- Separation of Concerns: TypeScript handles extraction, authentication, and transport. SQL handles business logic, joins, and aggregation. Orchestration handles scheduling, retries, and lineage.
Step-by-Step Implementation
Step 1: Extract and Load to Cloud Storage
A TypeScript module extracts from a PostgreSQL source, batches rows, and writes to S3/GCS in partitioned Parquet files.
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
import { createReadStream, createWriteStream } from "fs";
import { pipeline } from "stream/promises";
import { parse } from "pg-copy-streams";
import { compress } from "zlib";
const s3 = new S3Client({ region: process.env.AWS_REGION });
const BATCH_SIZE = 50_000;
async function extractAndLoad(schema: string, table: string, partitionDate: string) {
const copyStream = parse(`COPY ${schema}.${table} TO STDOUT WITH (FORMAT CSV, HEADER true)`);
const rawStream = createWriteStream(`/tmp/${table}_${partitionDate}.csv`);
// Execute copy command via pg pool (omitted for brevity)
// await pool.query(copyStream);
// Compress and upload to S3 raw zone
const compressedStream = compress({ level: 9 });
const uploadCommand = new PutObjectCommand({
Bucket: process.env.RAW_BUCKET,
Key: `raw/${schema}/${table}/${partitionDate}/data.parquet`,
Body: createReadStream(`/tmp/${table}_${partitionDate}.csv`).pipe(compressedStream),
ContentType: "application/parquet",
});
await s3.send(uploadCommand);
console.log(`Loaded ${table} partition ${partitionDate} to raw zone`);
}
export { extractAndLoad };
Step 2: Warehouse Schema and Staging
Define a staging view that reads directly from cloud storage using external tables or native file formats.
-- BigQuery / Snowflake syntax concept
CREATE OR REPLACE EXTERNAL TABLE raw.stg_orders
LOCATION = 'gs://your-bucket/raw/public/orders/'
FORMAT = PARQUET
OPTIONS (
hive_partition_uri_prefix = 'gs://your-bucket/raw/public/orders/',
require_partition_filter = true
);
Step 3: Transform in Warehouse
Use SQL to clean, join, and aggregate. This runs inside the CDW, not in application memory.
-- curate/orders_daily.sql
WITH raw_orders AS (
SELECT
order_id,
customer_id,
order_date::DATE AS order_date,
status,
amount
FROM raw.stg_orders
WHERE order_date = '{{ var("execution_date") }}'
),
customer_dims AS (
SELECT customer_id, tier, region
FROM curated.dim_customers
)
SELECT
r.order_id,
r.customer_id,
c.tier,
c.region,
r.order_date,
r.status,
r.amount,
r.amount * 0.2 AS tax_estimate
FROM raw_orders r
JOIN customer_dims c ON r.customer_id = c.customer_id
WHERE r.status NOT IN ('cancelled', 'refunded');
Step 4: Orchestration and Execution
Trigger extraction, then run transformations. The pipeline treats the warehouse as the compute engine.
import { extractAndLoad } from "./extractor";
import { execSync } from "child_process";
async function runELTPipeline(date: string) {
const tables = ["orders", "customers", "payments"];
for (const table of tables) {
await extractAndLoad("public", table, date);
}
// Execute warehouse transformations via dbt or native SQL runner
execSync(`dbt run --select curated.orders_daily --vars '{"execution_date": "${date}"}'`);
console.log(`ELT pipeline complete for ${date}`);
}
runELTPipeline("2024-06-15");
This architecture eliminates dedicated transformation clusters, reduces data movement, and aligns compute with the storage layer. The TypeScript layer remains lightweight, focused on extraction, retry logic, and transport. All business logic lives in version-controlled SQL, enabling peer review, testing, and lineage tracking.
Pitfall Guide
-
Loading Unvalidated Data into Production Tables
Raw data often contains malformed dates, null keys, or schema drift. Loading directly into curated tables breaks downstream dashboards. Fix: Enforce a strict raw β staging β curated boundary. Validate at the staging layer using data contracts or assertion frameworks.
-
Over-Engineering Transformations in TypeScript
Teams port ETL transformation logic to Node.js, defeating the purpose of ELT. JavaScript lacks set-based optimization, partition pruning, and parallel execution. Fix: Keep TS for extraction/transport. Push joins, aggregations, and window functions to SQL. Use dbt or warehouse-native stored procedures.
-
Ignoring CDC vs Full Refresh Trade-offs
Full table loads on high-volume sources saturate network bandwidth and warehouse ingestion queues. CDC reduces payload but requires log parsing, schema tracking, and idempotency handling. Fix: Use CDC for tables >1M rows or high-update frequency. Use incremental loads with updated_at filters for medium-volume tables. Reserve full refreshes for dimension tables or low-churn sources.
-
Mismanaging Compute Credits in CDWs
ELT shifts cost from ETL clusters to warehouse queries. Poorly written transformations, missing clustering keys, or unpartitioned joins cause credit spikes. Fix: Profile queries with EXPLAIN, enforce partition filters, materialize intermediate results only when necessary, and set warehouse size auto-suspend thresholds.
-
Schema Drift Without Contract Enforcement
Source systems add columns, change types, or drop fields. ELT pipelines that blindly ingest raw data propagate breakages downstream. Fix: Implement schema validation at load time. Use data contract tools (e.g., Great Expectations, Soda, or custom TypeScript validators) to reject payloads violating expected schemas.
-
Security and Compliance Bypass
Raw zones often contain PII, credentials, or regulated data. Storing unmasked raw data violates GDPR, HIPAA, or SOC2 requirements. Fix: Apply column-level masking or tokenization at ingestion. Use warehouse-native row-level security and column-level encryption. Never expose raw zones to business users.
-
Treating ELT as "Set and Forget"
ELT pipelines degrade silently. Schema changes, upstream API deprecations, and data quality drift accumulate. Fix: Implement data quality gates between raw and staging. Monitor row counts, null ratios, and freshness SLAs. Alert on contract violations before transformations run.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time analytics (<5 min latency) | ELT with CDC + stream ingestion | Warehouse-native streaming reduces ETL middleware overhead | +15-25% warehouse credits, -40% infra maintenance |
| Compliance-heavy (PII/PHI masking required) | Hybrid (ETL for masking β ELT for analytics) | Pre-processing ensures regulatory compliance before warehouse load | +20% pipeline complexity, -60% audit risk |
| Legacy on-prem source with no CDC | ETL | Limited change data capture forces full extract; transformation offloads warehouse | +30% network egress, +10% compute cost |
| Startup MVP / Rapid prototyping | ELT | Minimizes infrastructure, accelerates iteration, leverages serverless DW | -50% upfront cost, +10% query optimization overhead |
| High-volume IoT/telemetry (>10B rows/day) | ELT with partitioned raw + incremental curated | Warehouse scales better than ETL clusters for append-heavy workloads | -35% total cost, requires partition strategy |
Configuration Template
# dbt_project.yml
name: 'codcompass_elt'
version: '1.0.0'
config-version: 2
profile: 'warehouse_profile'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
models:
codcompass_elt:
raw:
+materialized: view
+tags: ["raw", "immutable"]
staging:
+materialized: incremental
+unique_key: "ingestion_id"
+tags: ["staging", "validated"]
curated:
+materialized: table
+partition_by: "order_date"
+tags: ["curated", "business-logic"]
# orchestration/.env
RAW_BUCKET=gs://your-project/raw-zone
WAREHOUSE_SCHEMA=curated
EXECUTION_DATE={{ execution_date }}
TRANSFORM_TIMEOUT=300
DATA_CONTRACT_STRICT=true
Quick Start Guide
- Provision Warehouse & Storage: Create a cloud data warehouse instance (Snowflake/BigQuery/Redshift). Set up a raw storage bucket with lifecycle policies (30-day raw retention, 1-year curated).
- Deploy Extractor: Run the TypeScript extraction module against a source database. Verify partitioned files land in the raw zone with correct naming conventions.
- Initialize Transformation Layer: Run
dbt init in your project directory. Configure profiles.yml to connect to your warehouse. Execute dbt run to materialize staging and curated models.
- Validate & Schedule: Run
dbt test to enforce data contracts. Schedule the pipeline via cron, Airflow, or Dagster. Monitor query profiles and credit consumption for the first 72 hours. Adjust clustering keys and partition filters based on observed patterns.