are the only viable paths for systems targeting 99.99% availability at scale.
Core Solution
Implementing a robust archival strategy requires a hybrid architecture separating hot transactional data from cold archival storage. The solution below outlines a Stream-Based Archival pattern using TypeScript, suitable for high-throughput systems.
Architecture Decisions
- Hot/Cold Separation: The production database retains only data required for active business logic (e.g., last 90 days). Older data is moved to immutable object storage.
- Idempotent Movement: Archival jobs must be idempotent. If a job fails after copying but before deleting, re-running the job must not duplicate data or cause errors.
- Integrity Verification: Archival must include checksum validation to ensure data integrity during transfer.
- Keyset Pagination: Archival jobs must use keyset pagination to avoid performance degradation on large datasets, unlike offset-based pagination.
Step-by-Step Implementation
1. Schema Preparation
Add an archived_at timestamp to track lifecycle state. Ensure the archival key is indexed.
ALTER TABLE transactions ADD COLUMN archived_at TIMESTAMPTZ;
CREATE INDEX idx_transactions_archived ON transactions(archived_at, id);
2. Archival Service (TypeScript)
This service batches data, writes to object storage, verifies integrity, and safely deletes source rows.
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
import { createHash } from "crypto";
import { Pool } from "pg";
interface ArchiveConfig {
batchSize: number;
retentionDays: number;
bucketName: string;
region: string;
}
interface ArchiveResult {
archivedCount: number;
checksum: string;
prefix: string;
}
export class DataArchivalService {
private s3: S3Client;
private db: Pool;
constructor(private config: ArchiveConfig) {
this.s3 = new S3Client({ region: config.region });
this.db = new Pool({ /* DB Config */ });
}
/**
* Executes archival using keyset pagination for O(1) performance.
* Processes data in batches to prevent transaction log explosion.
*/
async runArchival(): Promise<void> {
const cutoffDate = new Date();
cutoffDate.setDate(cutoffDate.getDate() - this.config.retentionDays);
let lastArchivedId = 0;
let totalArchived = 0;
while (true) {
// Keyset pagination: fetch rows older than cutoff, ordered by ID
const query = `
SELECT id, payload, created_at
FROM transactions
WHERE archived_at IS NULL
AND created_at < $1
AND id > $2
ORDER BY id ASC
LIMIT $3
FOR UPDATE SKIP LOCKED
`;
const res = await this.db.query(query, [cutoffDate, lastArchivedId, this.config.batchSize]);
const rows = res.rows;
if (rows.length === 0) break;
// Process batch
const batchData = JSON.stringify(rows);
const checksum = createHash('sha256').update(batchData).digest('hex');
const s3Key = `archive/transactions/${Date.now()}-${checksum}.jsonl`;
// Upload to S3
await this.s3.send(new PutObjectCommand({
Bucket: this.config.bucketName,
Key: s3Key,
Body: batchData,
Metadata: { checksum, 'archival-date': new Date().toISOString() }
}));
// Transactional delete
const ids = rows.map(r => r.id);
await this.db.query(
`UPDATE transactions SET archived_at = NOW() WHERE id = ANY($1)`,
[ids]
);
// Verify deletion count matches batch
const deleteCheck = await this.db.query(
`SELECT count(*) FROM transactions WHERE id = ANY($1) AND archived_at IS NULL`,
[ids]
);
if (parseInt(deleteCheck.rows[0].count) !== 0) {
throw new Error("Archival integrity check failed: Rows still present after update.");
}
lastArchivedId = rows[rows.length - 1].id;
totalArchived += rows.length;
}
console.log(`Archival complete. Total rows moved: ${totalArchived}`);
}
}
3. Retrieval Strategy
Archived data must be queryable for compliance and debugging. Implement a retrieval proxy that checks hot storage first, then falls back to cold storage.
async function retrieveTransaction(id: string): Promise<Transaction | null> {
// 1. Check hot storage
const hotResult = await db.query('SELECT * FROM transactions WHERE id = $1', [id]);
if (hotResult.rows.length > 0) return hotResult.rows[0];
// 2. Check metadata index for archive location
// (Requires a lightweight index table mapping ID -> S3 Key)
const archiveMeta = await archiveIndexDb.query('SELECT s3_key FROM archive_index WHERE id = $1', [id]);
if (!archiveMeta.rows.length) return null;
// 3. Fetch from S3 and parse
const s3Result = await s3.getObject({ Bucket: '...', Key: archiveMeta.rows[0].s3_key });
const content = await s3Result.Body.transformToString();
const transactions = JSON.parse(content);
return transactions.find(t => t.id === id) || null;
}
Pitfall Guide
- Soft Delete Index Bloat: Using
is_deleted flags keeps rows in indexes. Queries scanning for active data must filter these rows, and VACUUM must process them. This increases I/O and storage costs without benefit.
- Fix: Physically remove data or partition it out.
- Blocking Transactions: Archival jobs that lock large ranges of rows block production traffic.
- Fix: Use
SKIP LOCKED and small batches. Run archival during low-traffic windows if possible, or use CDC to offload archival to a separate pipeline.
- Orphaned Archive Data: Archiving parent records without children, or vice versa, breaks referential integrity in the archive.
- Fix: Archive in dependency order or use a denormalized archive format where relationships are flattened. Maintain a mapping table in the archive.
- No Retrieval Path: Archiving data to a "black hole" where retrieval is manual or impossible violates compliance requirements.
- Fix: Build an automated retrieval layer. Ensure the archive index is searchable by business keys, not just internal IDs.
- Ignoring Egress Costs: Retrieving data from cold storage tiers (e.g., Glacier Deep Archive) incurs high egress fees and latency (hours to days).
- Fix: Classify data by retrieval urgency. Use "Warm" tiers for data needed within minutes/hours. Reserve "Cold" for compliance-only data.
- Checksum Neglect: Data corruption during transfer or storage degradation is rare but catastrophic.
- Fix: Always compute and store checksums. Verify checksums during retrieval and periodic integrity audits.
- Schema Drift in Archives: If the production schema changes, old archive files may become unparseable.
- Fix: Embed schema version in archive metadata. Use flexible formats (JSON) or implement a migration layer for archive retrieval that handles versioning.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-Velocity Logs | Stream to Object Store (Parquet) | Decouples DB; Parquet enables columnar analytics. | Low storage; Low compute. |
| Financial Transactions | Partitioned DB + Immutable S3 | Retains queryability on hot data; compliance on cold. | Medium storage; Medium complexity. |
| User PII Cleanup | Soft Delete + Batch Purge | Simplicity for GDPR "Right to be Forgotten". | Low cost; Low risk if purged promptly. |
| Audit Trails | Write-Once-Read-Many (WORM) Storage | Legal requirement for immutability. | Low storage; High compliance value. |
| Legacy Data Migration | One-time Bulk Copy + Delete | Reduces active DB size immediately. | One-time compute cost; Long-term savings. |
Configuration Template
Use this YAML configuration to parameterize archival workers across environments.
# archive-config.yaml
archival:
retention:
transactions: 90d
user_events: 365d
audit_logs: 2555d # 7 years
storage:
hot:
provider: postgres
connection: ${DB_URL}
cold:
provider: s3
bucket: ${ARCHIVE_BUCKET}
region: us-east-1
tier: STANDARD_IA # For frequent access
lifecycle:
- transition_to: GLACIER
after_days: 180
pipeline:
batch_size: 5000
concurrency: 4
checksum_algorithm: sha256
idempotency_key: "archival-job-v1"
retrieval:
cache_ttl: 300s # Cache hot results
fallback_strategy: async_fetch
Quick Start Guide
- Schema Update: Add
archived_at column and index to target tables. Run ALTER TABLE ... ADD COLUMN ....
- Deploy Worker: Containerize the TypeScript archival service. Configure environment variables for DB and S3 access.
- Schedule Execution: Set up a cron job or Kubernetes CronJob to run the archival service every hour.
- Verify Metrics: Check CloudWatch/Datadog for
archival_rows_processed and archival_errors. Ensure batch processing is smooth.
- Test Retrieval: Manually query an archived record via the retrieval proxy to confirm the hot/cold fallback works.
Data archival is not a maintenance task; it is a core architectural requirement for scalable systems. By implementing stream-based or partitioned strategies, you decouple storage growth from performance degradation, reduce TCO by orders of magnitude, and ensure compliance readiness. Treat data lifecycle with the same rigor as application code.