';
export interface PhaseMetrics {
bootstrapDuration: Histogram;
executionDuration: Histogram;
}
export class AgentLifecycleProfiler {
private readonly meter: Meter;
private readonly bootstrapStart: number;
private readonly executionStart: number | null = null;
constructor(meter: Meter, private readonly agentId: string) {
this.meter = meter;
this.bootstrapStart = performance.now();
}
markBootstrapComplete(toolName: string): void {
const durationMs = performance.now() - this.bootstrapStart;
this.bootstrapDuration.record(durationMs / 1000, {
agent_id: this.agentId,
first_tool: toolName,
env: process.env.DEPLOYMENT_ENV || 'unknown'
});
this.executionStart = performance.now();
}
markExecutionComplete(): void {
if (!this.executionStart) return;
const durationMs = performance.now() - this.executionStart;
this.executionDuration.record(durationMs / 1000, {
agent_id: this.agentId
});
}
private get bootstrapDuration(): Histogram {
return this.meter.createHistogram('agent.bootstrap_seconds', {
description: 'Time from spawn to first tool dispatch'
});
}
private get executionDuration(): Histogram {
return this.meter.createHistogram('agent.execution_seconds', {
description: 'Time from first tool dispatch to completion'
});
}
}
**Rationale:** This implementation decouples timing logic from business flow. By using OpenTelemetry histograms, you enable downstream aggregation at specific percentiles (p95/p99). The `markBootstrapComplete` hook provides the authoritative data point for initialization cost, filtering out noise from warm starts or cached credentials.
#### 2. Calculate Timeouts Using Phase Budgeting
Timeouts must be derived dynamically from observed metrics rather than hardcoded guesses. The total timeout is the sum of the initialization budget, the execution budget, and a safety margin for network jitter and rate limits.
```typescript
export interface BudgetParameters {
initP95: number;
execP95: number;
safetyBuffer: number;
}
export class TimeoutEngine {
static deriveTotalTimeout(params: BudgetParameters): number {
const rawTotal = params.initP95 + params.execP95 + params.safetyBuffer;
return Math.ceil(rawTotal);
}
static deriveExecutionBudget(params: BudgetParameters): number {
// Returns the window available for actual work after init
return Math.max(0, params.execP95 + params.safetyBuffer);
}
}
// Configuration derived from metrics pipeline
const currentBudget: BudgetParameters = {
initP95: 95.4, // From 7-day rolling p95 of bootstrap metric
execP95: 420.0, // From historical workload p95
safetyBuffer: 180 // 3-minute buffer for delivery latency and retries
};
const schedulerTimeout = TimeoutEngine.deriveTotalTimeout(currentBudget);
// Result: 695.4 -> 696 seconds
Rationale: This formula forces explicit acknowledgment of the bootstrap tax. When the agent's workspace grows and initP95 increases, the timeout automatically adjusts if integrated with a configuration management system. The safety buffer absorbs variance in external dependencies, preventing delivery failures caused by transient network issues.
3. Reorder Pipeline Execution
Scheduler deadlines are immutable. When a timeout fires, the runtime terminates the process immediately. If human-facing output is queued behind internal housekeeping, it will be dropped. The execution pipeline must prioritize delivery over archival.
export class ExecutionOrchestrator {
constructor(
private readonly worker: AgentWorker,
private readonly notifier: DeliveryNotifier,
private readonly archiver: ArtifactArchiver
) {}
async run(runId: string): Promise<void> {
// Phase 1: Core Workload
const result = await this.worker.process(runId);
// Phase 2: Critical Delivery (Must complete before deadline)
// Placed before cleanup to guarantee user visibility
await this.notifier.send({
channel: 'slack',
payload: result.summary,
runId: runId,
priority: 'high'
});
// Phase 3: Internal Housekeeping (Safe to interrupt)
// Use allSettled to prevent cleanup errors from masking delivery success
await Promise.allSettled([
this.archiver.compress(runId, result.artifacts),
this.worker.updateLedger(runId, result.metadata)
]);
}
}
Rationale: By inverting the dependency graph, you ensure that the user receives the output even if the scheduler terminates the process during archival. Internal state mutations like ledger updates can be reconciled later via idempotent backfilling, but a missed notification is often a permanent loss of trust.
4. Decouple Delivery Idempotency from Work Idempotency
Work idempotency prevents duplicate processing. Delivery idempotency prevents duplicate announcements. These concerns must be separated. If a previous run completed the work but failed to deliver, a retry must recognize the missing announcement and re-publish it, regardless of whether backend artifacts already exist.
export interface DeliveryRegistry {
isAnnounced(key: string): Promise<boolean>;
recordAnnouncement(key: string): Promise<void>;
}
export class DeliveryGuard {
constructor(private readonly registry: DeliveryRegistry) {}
async ensureAnnouncement(
runKey: string,
payload: string,
publisher: (msg: string) => Promise<void>
): Promise<void> {
const alreadySent = await this.registry.isAnnounced(runKey);
if (alreadySent) {
return;
}
await publisher(payload);
await this.registry.recordAnnouncement(runKey);
}
}
// Integration within the notifier
const guard = new DeliveryGuard(new RedisDeliveryRegistry());
await guard.ensureAnnouncement(
`delivery:${runId}:slack`,
summary,
async (msg) => slackClient.postMessage({ channel: '#alerts', text: msg })
);
Rationale: This pattern isolates delivery state from workload state. The runKey encodes temporal and contextual boundaries, ensuring retries only re-announce when the previous delivery was genuinely clipped. The registry can be backed by Redis, DynamoDB, or SQL, depending on latency requirements.
Pitfall Guide
1. Conflating Cold and Warm Start Metrics
Explanation: Containerized agents often experience cold starts that inflate initialization time by 30–50%. Sizing timeouts based on warm-start benchmarks leads to consistent deadline hits during scale-up events.
Fix: Instrument cold and warm paths separately. Use the cold-start p95 for timeout calculation, or implement a pre-warming strategy to maintain a baseline pool of initialized agents.
2. Merging Work and Delivery Idempotency
Explanation: Teams often assume that checking for existing artifacts implies the notification was sent. This fails when work completes but the delivery network call times out.
Fix: Maintain a dedicated delivery registry. Never infer announcement status from backend artifacts. The delivery guard must operate independently of the worker's state checks.
3. Budgeting Based on Median Latency
Explanation: Median runtime hides tail latency. Model inference, credential resolution, and network calls exhibit long-tail distributions. A median-based timeout clips the slowest 50% of runs.
Fix: Always use p95 or p99 for budgeting. Track these percentiles in your metrics backend and update timeout configurations when percentiles drift.
4. Prioritizing Cleanup Over Delivery
Explanation: Developers naturally group operations: work → log → archive → notify. This ordering guarantees notifications are the first to be dropped when deadlines fire.
Fix: Reverse the dependency graph. Human-facing outputs must execute before any internal state mutations that can tolerate interruption.
5. Ignoring Delivery Network Variance
Explanation: The delivery step involves external APIs with variable latency, rate limits, or TLS handshakes. If the timeout budget does not account for this, the delivery call itself becomes the failure point.
Fix: Add explicit network buffers to the safety margin. Implement circuit breakers and retry policies with exponential backoff specifically for the delivery gateway.
6. Hardcoding Timeout Values
Explanation: Embedding timeout values in YAML or JSON files creates configuration drift. When initialization costs change, the timeout remains static until manually updated.
Fix: Compute timeouts dynamically based on observed metrics. Use configuration management tools that pull p95 values from your metrics pipeline at deployment time.
7. Benchmarking in Isolation
Explanation: Local testing runs agents in isolation. Production schedulers run them concurrently, competing for CPU, memory, and network bandwidth. This contention inflates initialization and execution times.
Fix: Benchmark under realistic concurrency. Use load testing tools to simulate production traffic patterns when measuring phase durations.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Variance Init | Pre-warming + Cold Start Budget | Reduces tail latency and stabilizes delivery windows. | Moderate (resource overhead for pre-warming). |
| Strict Cost Constraints | Phase-Decoupled Timeout | Maximizes success rate without increasing compute spend. | Low (configuration change only). |
| Multi-Channel Delivery | Delivery Registry per Channel | Prevents duplicate notifications across Slack, Email, Webhook. | Low (storage cost for registry). |
| Rapid Workspace Growth | Dynamic Timeout Calculation | Automatically adjusts to increasing initialization costs. | Low (metrics pipeline dependency). |
Configuration Template
# agent-scheduler-config.yaml
# Example configuration for dynamic timeout management
scheduling:
strategy: phase_decoupled
metrics_source: opentelemetry_pipeline
budget:
# Percentiles derived from 7-day rolling window
init_p95:
source: metric:agent.bootstrap_seconds
percentile: 0.95
exec_p95:
source: metric:agent.execution_seconds
percentile: 0.95
# Safety buffer in seconds
safety_buffer: 180
# Granularity for scheduler rounding
rounding_granularity: 60 # Round up to nearest minute
delivery:
idempotency_store: redis
retry_policy:
max_attempts: 3
backoff_multiplier: 2.0
initial_delay: 5s
Quick Start Guide
- Instrument: Add the
AgentLifecycleProfiler to your agent's entry point. Call markBootstrapComplete immediately after the first tool dispatch and markExecutionComplete at the end of the workload.
- Collect: Deploy the instrumentation and allow metrics to accumulate for 24–48 hours. Verify that
agent.bootstrap_seconds and agent.execution_seconds are populating in your metrics backend.
- Calculate: Query the p95 values for both metrics. Use the
TimeoutEngine to compute the new total timeout. Update your scheduler configuration with this value.
- Reorder: Refactor your agent's execution pipeline to call the delivery service before any archival or cleanup steps.
- Validate: Run a series of scheduled jobs and monitor the delivery success rate. Confirm that the bootstrap tax is now accounted for and deliveries are completing before deadlines.