es restructuring how concurrent steps are declared and awaited. The Durable Functions SDK provides two primitives: context.parallel() for fixed sets of independent operations, and context.map() for dynamic arrays of homogeneous tasks. Both enforce declaration-order checkpointing while executing work concurrently.
Step-by-Step Implementation
- Identify concurrent boundaries: Locate all
Promise.all() calls that spawn independent I/O or compute tasks within a single durable step.
- Extract step handlers: Convert each inline promise into a named async function that the SDK can track individually.
- Replace with SDK primitives: Swap
Promise.all() for context.parallel() or context.map() depending on whether the workload is fixed or dynamic.
- Align error handling: Durable parallel primitives aggregate errors differently than native promises. Implement structured error boundaries to handle partial failures.
- Validate replay behavior: Test execution under simulated cold starts and retry conditions to confirm checkpoint alignment.
New Code Example
Consider a workflow that retrieves customer profile data, billing history, and support tickets. The original pattern uses Promise.all():
// Anti-pattern: Non-deterministic checkpoint assignment
const profilePromise = context.step(async () => fetchCustomerProfile(customerId));
const billingPromise = context.step(async () => fetchBillingHistory(customerId));
const ticketsPromise = context.step(async () => fetchSupportTickets(customerId));
const [profile, billing, tickets] = await Promise.all([
profilePromise,
billingPromise,
ticketsPromise
]);
Refactored using the SDK's deterministic parallel primitive:
// Deterministic pattern: SDK-controlled checkpoint ordering
const loadProfile = context.step(async () => fetchCustomerProfile(customerId));
const loadBilling = context.step(async () => fetchBillingHistory(customerId));
const loadTickets = context.step(async () => fetchSupportTickets(customerId));
const [profile, billing, tickets] = await context.parallel([
loadProfile,
loadBilling,
loadTickets
]);
For dynamic workloads, context.map() replaces array-based promise spawning:
// Dynamic parallel execution with deterministic ordering
const invoiceIds = await context.step(async () => fetchInvoiceIds(customerId));
const invoiceDetails = await context.map(invoiceIds, async (invoiceId) => {
return context.step(async () => fetchInvoiceDetail(invoiceId));
});
Architecture Decisions and Rationale
Why declaration-order checkpointing matters: The SDK assigns checkpoint IDs when the step function is registered, not when it resolves. This guarantees that replay reconstructs the exact same execution graph regardless of network latency or OS scheduling. Promise.all() defers ID assignment to resolution time, breaking this contract.
Why separate step wrappers are required: Each context.step() call registers a checkpoint boundary. Wrapping concurrent operations in individual step handlers allows the SDK to track state, serialize inputs/outputs, and apply retry policies per operation.
Why error aggregation differs: Native Promise.all() fails fast on the first rejection, leaving other promises unresolved. SDK parallel primitives wait for all operations to complete, then return a structured result array containing success values and error objects. This enables partial failure handling, which is critical for durable workflows where orphaned promises can leak resources or leave state inconsistent.
Why timeout alignment is necessary: Durable functions operate under explicit timeout boundaries. When using context.parallel(), the SDK enforces a single timeout for the entire batch. Individual operations must complete within this window, or the batch fails. This prevents runaway concurrent tasks from exhausting memory or triggering Lambda execution limits.
Pitfall Guide
1. Mixing Native Promises with Durable Steps
Explanation: Developers often wrap only some operations in context.step() while leaving others as raw promises inside Promise.all(). This creates a hybrid execution model where checkpointed and non-checkpointed work compete for event loop priority.
Fix: Every concurrent operation must be wrapped in context.step() before being passed to context.parallel() or context.map(). The SDK requires explicit registration to maintain checkpoint integrity.
2. Assuming Network Latency Guarantees Order
Explanation: Teams assume that because operations are independent, resolution order doesn't matter. In durable execution, resolution order directly impacts checkpoint alignment during replay.
Fix: Never rely on resolution order for state reconstruction. Use SDK parallel primitives that enforce declaration-order checkpointing regardless of actual execution timing.
3. Ignoring Partial Failure Semantics
Explanation: Promise.all() fails fast, but context.parallel() waits for all operations to complete and returns an array of results and errors. Developers who expect native promise behavior often miss error objects in the result array.
Fix: Destructure results carefully and check for error properties. Implement explicit error boundaries that handle partial failures without aborting the entire workflow.
4. Over-Parallelizing CPU-Bound Work
Explanation: Durable parallel primitives are optimized for I/O-bound operations. Spawning CPU-intensive tasks concurrently can exhaust Lambda memory, trigger throttling, or cause cold start degradation.
Fix: Profile execution characteristics. Reserve context.parallel() for network calls, database queries, and external API interactions. Offload CPU-heavy work to dedicated compute layers or batch processing pipelines.
5. Misaligning Timeout Boundaries
Explanation: The SDK applies a single timeout to the entire parallel batch. If one operation takes longer than expected, it delays the entire batch and may trigger Lambda execution limits.
Fix: Configure explicit timeouts per operation where possible, or set batch-level timeouts that account for the slowest expected dependency. Monitor execution duration metrics to adjust boundaries proactively.
6. Forgetting Checkpoint Serialization Limits
Explanation: Each context.step() serializes inputs and outputs to the checkpoint log. Passing large payloads or circular references causes serialization failures or checkpoint bloat.
Fix: Keep step inputs/outputs lightweight. Pass identifiers instead of full objects. Use external storage (S3, DynamoDB) for large payloads and reference them by key in durable steps.
7. Assuming context.map() Auto-Batches
Explanation: context.map() executes all items concurrently by default. Large arrays can trigger Lambda concurrency limits, memory exhaustion, or downstream rate limiting.
Fix: Implement explicit batching or chunking strategies. Use context.map() with controlled concurrency limits, or paginate large datasets before parallel execution.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Fixed set of independent I/O calls | context.parallel() | Deterministic checkpoint ordering, aggregated error handling | Low (single execution window) |
| Dynamic array of homogeneous tasks | context.map() | Scales with input size, maintains declaration-order checkpoints | Medium (concurrency scales with array length) |
| State-dependent sequential steps | Standard await | Required when step B depends on step A's output | Low (no parallel overhead) |
| High-throughput event processing | External orchestration (Step Functions/SQS) | Lambda durable functions are not designed for high-volume parallel pipelines | High (infrastructure overhead, but necessary for scale) |
| CPU-intensive batch processing | Dedicated compute (Fargate/Batch) | Lambda memory/CPU limits make parallel CPU work inefficient | Medium-High (provisioned resources vs pay-per-use) |
Configuration Template
import { Context, Step } from '@aws-lambda-durable/functions';
interface WorkflowContext {
customerId: string;
maxRetries: number;
timeoutMs: number;
}
export async function handler(context: Context<WorkflowContext>) {
const { customerId, maxRetries, timeoutMs } = context.input;
// Define deterministic step handlers
const loadProfile: Step<any> = context.step(async () => {
return fetchCustomerProfile(customerId);
});
const loadBilling: Step<any> = context.step(async () => {
return fetchBillingHistory(customerId);
});
const loadTickets: Step<any> = context.step(async () => {
return fetchSupportTickets(customerId);
});
// Execute with deterministic checkpoint alignment
const results = await context.parallel([loadProfile, loadBilling, loadTickets], {
timeout: timeoutMs,
retryPolicy: { maxAttempts: maxRetries, backoffMs: 500 }
});
// Handle partial failures explicitly
const errors = results.filter(r => r.error);
if (errors.length > 0) {
context.logger.warn('Partial failure in parallel batch', { errors });
// Implement fallback or compensation logic
}
const [profile, billing, tickets] = results.map(r => r.value);
return {
profile,
billing,
tickets,
executionId: context.executionId,
checkpointVersion: context.checkpointVersion
};
}
Quick Start Guide
- Install the Durable Functions SDK: Run
npm install @aws-lambda-durable/functions and configure your Lambda handler to use the SDK's context wrapper.
- Identify concurrent boundaries: Search your codebase for
Promise.all() and isolate operations that run independently within a single execution context.
- Wrap operations in
context.step(): Convert each inline promise into a named async function registered with the SDK scheduler.
- Replace with
context.parallel(): Pass the registered steps to context.parallel(), configure timeout and retry policies, and handle the aggregated result array.
- Validate with replay testing: Trigger cold starts and retry conditions to confirm checkpoint alignment. Monitor CloudWatch logs for checkpoint serialization warnings and adjust payload sizes if necessary.