o its volatility, schema stability, and consumer profile. Classify workloads into:
- Structured BI: Fixed schema, high concurrency, low latency, strict SLAs
- Exploratory/ML: Semi-structured, high throughput, schema evolution, batch/streaming
- Compliance/Archival: Immutable, low access frequency, retention policies
-
Select Storage Paradigm Based on Workload
- Use DW for structured, high-concurrency SQL workloads with strict governance.
- Use DL for raw ingestion, ML training data, and unstructured/semi-structured assets.
- Use Lakehouse for mixed workloads requiring ACID transactions, time travel, and schema evolution without vendor lock-in.
-
Implement Medallion Architecture with Open Formats
Bronze (raw ingestion) → Silver (cleaned, deduplicated, typed) → Gold (business-ready, aggregated). Enforce schema evolution at the Silver layer using Iceberg or Delta Lake. This prevents the "data swamp" while preserving DW-like reliability.
-
Decouple Compute from Storage
Provision separate compute clusters for ingestion (Spark/Flink), serving (Trino/Presto), and BI (Databricks SQL/Snowflake). Route queries through a unified catalog (AWS Glue, Hive Metastore, or Unity Catalog) to avoid data duplication.
-
Implement Tiered Storage & Lifecycle Policies
Apply S3 Intelligent-Tiering or equivalent. Set lifecycle rules: Hot (0-30 days), Warm (30-180 days), Cold (180-365 days), Archive (365+ days). Align compute engine caching with access patterns to avoid scanning cold data for latency-sensitive queries.
-
Establish Governance & Lineage
Deploy data quality checks at Silver ingestion (Great Expectations, Deequ, or native Delta/Iceberg constraints). Register all tables in a central catalog. Implement row/column-level security and audit logging before production rollout.
Code Examples
Spark SQL: Iceberg Table Creation with Schema Evolution
CREATE TABLE analytics.events_silver (
event_id STRING,
user_id STRING,
event_type STRING,
timestamp TIMESTAMP,
properties MAP<STRING, STRING>
)
USING ICEBERG
PARTITIONED BY (hours(timestamp), event_type)
TBLPROPERTIES (
'format-version' = '2',
'write.format.default' = 'parquet',
'write.parquet.compression-codec' = 'zstd',
'write.metadata.previous-versions-max' = '5'
);
-- Schema evolution: add column without rewriting data
ALTER TABLE analytics.events_silver ADD COLUMNS (session_id STRING);
Terraform: Decoupled S3 Storage + IAM for Lakehouse
resource "aws_s3_bucket" "data_lake" {
bucket = "${var.env}-data-lake-${var.region}"
lifecycle_rule {
id = "tiering"
enabled = true
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 180
storage_class = "GLACIER"
}
expiration {
days = 2555
}
}
}
resource "aws_iam_role" "spark_compute" {
name = "${var.env}-spark-compute"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy_attachment" "s3_access" {
role = aws_iam_role.spark_compute.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess"
}
Architecture Decisions
- Table Format Selection: Iceberg excels in multi-engine environments (Spark, Trino, Flink, BigQuery). Delta Lake optimizes for Databricks ecosystems. Choose based on compute diversity, not vendor preference.
- Partitioning Strategy: Avoid over-partitioning. Use dynamic partition pruning and bucketing for high-cardinality columns. Partition by time (day/hour) and low-cardinality filters (region, tenant).
- Compaction & Optimization: Schedule Z-ordering (Delta) or sort-based compaction (Iceberg) weekly for frequently filtered columns. Monitor file size distribution; target 128MB-1GB per file.
- Query Routing: Route BI tools to serving layer (Trino/BigQuery). Route ML pipelines to raw/silver layers via Spark. Never allow ad-hoc queries against Bronze.
Pitfall Guide
-
Treating the Data Lake as a Dumping Ground
Ingesting raw files without validation, partitioning, or catalog registration creates a data swamp. Mitigation: Enforce Bronze→Silver pipeline contracts. Reject malformed records to a quarantine bucket with alerting.
-
Schema-on-Read as an Excuse for Poor Data Quality
Deferring schema enforcement until query time shifts cost to downstream consumers. Mitigation: Apply schema validation at ingestion using Avro/Protobuf contracts or Iceberg schema evolution rules.
-
Dual-Stack Duplication Without a Unified Catalog
Running DW and DL in parallel without cross-platform metadata leads to version drift. Mitigation: Use AWS Glue, Unity Catalog, or Hive Metastore as the source of truth. Query DL tables via Trino, push only aggregated Gold tables to DW.
-
Ignoring Data Lifecycle & Tiering
Storing all data in hot storage inflates costs by 300-500%. Mitigation: Implement automated lifecycle policies. Archive compliance data. Use materialized views for frequent aggregations instead of scanning raw partitions.
-
Over-Provisioning Compute for Ad-Hoc Queries
Dedicated clusters for exploratory analysis waste credits. Mitigation: Use serverless query engines (Athena, BigQuery, Databricks SQL) for ad-hoc workloads. Reserve provisioned compute for SLA-bound pipelines.
-
Treating Governance as an Afterthought
Role-based access, PII masking, and audit trails delayed until production cause security breaches and compliance failures. Mitigation: Implement column-level encryption, data masking policies, and lineage tracking before pipeline deployment.
-
Skipping Compaction & File Management
Thousands of small files degrade query performance and increase metadata overhead. Mitigation: Schedule nightly compaction jobs. Monitor files_written vs files_read ratios. Target <1000 files per partition.
Production Bundle
Action Checklist
Decision Matrix
| Workload Characteristic | Recommended Approach | Why |
|---|
| Fixed schema, high-concurrency BI, strict SLAs | Data Warehouse | Optimized query engine, built-in governance, predictable latency |
| Raw ingestion, ML training, streaming, schema evolution | Data Lake | Cost-efficient storage, flexible formats, decoupled compute |
| Mixed SQL/ML workloads, ACID requirements, multi-engine access | Lakehouse (Iceberg/Delta) | Transactional safety, time travel, unified catalog, open format |
| Compliance/Archival, low access, long retention | Cold Storage + DL | Minimal cost, immutable retention, catalog-only metadata |
| Real-time analytics, sub-second latency, high throughput | DW + Streaming Ingest | Optimized for low-latency joins, materialized views, caching |
Configuration Template
Terraform + Spark + Iceberg Lakehouse Setup
# main.tf
variable "env" { default = "prod" }
variable "region" { default = "us-east-1" }
provider "aws" { region = var.region }
resource "aws_s3_bucket" "lakehouse" {
bucket = "${var.env}-lakehouse-${var.region}"
acl = "private"
}
resource "aws_glue_catalog_database" "analytics" {
name = "${var.env}_analytics"
}
resource "aws_iam_role" "spark_role" {
name = "${var.env}-spark-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "emr.amazonaws.com" }
}]
})
}
resource "aws_iam_role_policy" "s3_access" {
name = "${var.env}-s3-access"
role = aws_iam_role.spark_role.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"]
Resource = [
aws_s3_bucket.lakehouse.arn,
"${aws_s3_bucket.lakehouse.arn}/*"
]
}]
})
}
Spark Submit Command (Iceberg + S3)
spark-submit \
--packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.4.0 \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.spark_catalog.warehouse=s3a://${var.env}-lakehouse-${var.region}/ \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
--class com.codcompass.iceberg.IngestionJob \
/opt/app/ingestion-job.jar
Quick Start Guide
- Provision Storage & Catalog: Create S3/ADLS bucket with lifecycle rules. Register database in AWS Glue or Unity Catalog. Attach IAM roles with least-privilege S3 access.
- Initialize Table Format: Deploy Iceberg or Delta Lake runtime. Create Bronze table with partitioning by ingestion date. Configure schema enforcement and quarantine routing.
- Ingest & Validate: Run Spark/Flink job to land raw data. Apply Great Expectations/Deequ checks at Silver layer. Reject malformed records, log failures, and trigger alerts.
- Serve & Optimize: Register Silver/Gold tables in unified catalog. Route BI tools to serving engine. Schedule weekly compaction, statistics collection, and tiered storage transitions.
- Govern & Monitor: Implement column-level security, PII masking, and audit logging. Track query latency, storage costs, and pipeline success rates. Adjust partitioning and compute sizing based on telemetry.
The warehouse-lake dichotomy is a legacy framing. Modern data architecture succeeds when storage format, compute topology, and governance boundaries align with workload characteristics, not vendor roadmaps. Build for evolution, enforce contracts early, and let the catalog drive discovery.