targets, and assignment boundaries.
export type ExperimentVariant = 'control' | 'treatment_A' | 'treatment_B';
export interface ExperimentConfig {
id: string;
version: number;
variants: Record<ExperimentVariant, number>; // weights summing to 100
primaryMetric: string;
secondaryMetrics: string[];
assignmentLevel: 'user' | 'session' | 'device';
holdoutPercent: number;
status: 'draft' | 'active' | 'completed' | 'archived';
}
Step 2: Build a Deterministic Assignment Engine
Assignment must be consistent across page loads, API calls, and client-server boundaries. Hash-based routing eliminates central state dependencies and guarantees idempotency.
import { createHash } from 'crypto';
export class AssignmentEngine {
private readonly salt: string;
constructor(salt: string) {
this.salt = salt;
}
assign(userId: string, config: ExperimentConfig): ExperimentVariant {
const hash = createHash('sha256')
.update(`${this.salt}:${userId}:${config.id}`)
.digest('hex');
const bucket = parseInt(hash.slice(0, 8), 16) % 1000;
if (bucket < config.holdoutPercent * 10) {
return 'control'; // holdout treated as control for safety
}
let cumulative = 0;
for (const [variant, weight] of Object.entries(config.variants)) {
cumulative += weight * 10;
if (bucket < cumulative) {
return variant as ExperimentVariant;
}
}
return 'control';
}
}
Architecture decision: Hash-based assignment avoids centralized databases, reduces latency, and guarantees deterministic routing even under partial failures. The salt isolates experiment namespaces and prevents cross-experiment correlation. Client-side assignment is acceptable for UI experiments; server-side assignment is mandatory for backend algorithms, pricing, or security-sensitive paths.
Step 3: Instrument Context-Rich Event Emission
Events must carry assignment context, user identifiers, and timestamp precision. Idempotency keys prevent double-counting from retries or SDK misfires.
export interface ExperimentEvent {
experimentId: string;
variant: ExperimentVariant;
userId: string;
eventId: string; // UUID v4
timestamp: number; // ISO 8601
payload: Record<string, unknown>;
}
export class ExperimentTracker {
private readonly endpoint: string;
private readonly queue: ExperimentEvent[] = [];
constructor(endpoint: string) {
this.endpoint = endpoint;
}
track(event: ExperimentEvent): void {
this.queue.push(event);
if (this.queue.length >= 50) {
this.flush();
}
}
private async flush(): Promise<void> {
const batch = this.queue.splice(0, 50);
try {
await fetch(this.endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(batch)
});
} catch (err) {
// Requeue on failure; implement exponential backoff in production
this.queue.unshift(...batch);
}
}
}
Architecture decision: Batched, asynchronous emission prevents blocking the critical path. The idempotency key (eventId) enables downstream deduplication. Payloads remain flexible but must never mutate after emission to preserve auditability.
Step 4: Decouple Analysis from Infrastructure
Analysis should never live inside the application runtime. Route events to a streaming warehouse (e.g., Kafka, Kinesis) or batch pipeline (e.g., Airflow, dbt), then apply statistical tests in a dedicated environment. Pre-register the analysis protocol: sample size calculation, primary metric definition, stopping rules, and correction method for multiple comparisons.
Architecture decision: Separating assignment, emission, and analysis eliminates coupling between product velocity and statistical rigor. It enables sequential testing, Bayesian updating, and retrospective validation without modifying application code.
Pitfall Guide
- Peeking Before Sample Size: Checking results mid-flight inflates Type I error. A 5% alpha threshold becomes 20-30% with repeated checks. Use sequential testing boundaries or Bayesian credible intervals with pre-defined stopping rules.
- Ignoring Multiple Comparisons: Running 10 experiments without correction guarantees at least one false positive. Apply Bonferroni, Holm-Bonferroni, or false discovery rate (FDR) control depending on hypothesis independence.
- Inconsistent Traffic Allocation: Mixing user-level and session-level assignment creates overlapping groups and violates independence assumptions. Lock assignment level at experiment creation and enforce it in the routing layer.
- Metric Contamination: Tracking secondary metrics as primary dilutes statistical power. Define one primary metric per experiment. Secondary metrics are for diagnostic insight, not decision gates.
- Novelty and Habituation Effects: Initial engagement spikes or drops distort early results. Exclude the first 48-72 hours from analysis or model time-decay explicitly.
- Infrastructure Latency Skew: Slow flag evaluation or event emission biases results toward faster clients or regions. Measure assignment latency and drop events that exceed P95 thresholds from analysis.
- Poor Stratification: Failing to account for geography, device type, or acquisition channel introduces confounding variables. Stratify randomization or include covariates in the analysis model.
Best practices from production: Pre-register hypotheses and analysis protocols before launch. Maintain a 5-10% holdout group to detect long-term drift. Cache assignment results locally to avoid repeated hash computation. Validate power calculations with realistic effect size estimates, not optimistic projections. Treat experiment configuration as immutable infrastructure; never update weights or variants mid-flight.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| UI component redesign | Client-side hash assignment + event tracking | Low latency, easy A/B routing, minimal backend changes | Low: SDK integration only |
| Backend ranking algorithm | Server-side assignment + streaming event pipeline | Deterministic routing, prevents client manipulation, handles high throughput | Medium: Requires infra pipeline and cache layer |
| Pricing/packaging change | Server-side assignment + strict holdout + sequential testing | Revenue impact demands high statistical rigor; holdout prevents long-term leakage | High: Requires financial modeling and extended run time |
| Onboarding flow optimization | Session-level assignment + stratified randomization | Users may switch devices; session consistency preserves UX flow | Low-Medium: Requires session management and cookie fallback |
| Feature flag rollout | Canary deployment with automated rollback triggers | Gradual exposure reduces blast radius; automated metrics gate deployment | Medium: Requires CI/CD integration and alerting |
Configuration Template
{
"experimentId": "exp_onboarding_v3",
"version": 1,
"status": "active",
"assignmentLevel": "user",
"holdoutPercent": 0.05,
"variants": {
"control": 0.50,
"treatment_progress_bar": 0.25,
"treatment_tooltips": 0.25
},
"primaryMetric": "completion_rate",
"secondaryMetrics": ["time_to_complete", "drop_off_step_2"],
"analysisProtocol": {
"testType": "two_sample_proportion",
"alpha": 0.05,
"power": 0.80,
"minimumDetectableEffect": 0.03,
"stoppingRule": "sequential_alpha_spending",
"multipleComparisonCorrection": "holm_bonferroni"
},
"routing": {
"salt": "prod_exp_salt_2024_q3",
"cacheTtlSeconds": 3600,
"fallbackVariant": "control"
}
}
Quick Start Guide
- Define the experiment contract: Create a configuration object matching the template. Set assignment level, variant weights, primary metric, and analysis protocol. Commit it to version control.
- Deploy the assignment engine: Integrate the hash-based
AssignmentEngine into your routing layer. Cache results per user/session with a TTL matching your traffic pattern.
- Instrument event emission: Wrap critical user actions with the
ExperimentTracker. Attach experimentId, variant, and eventId to every payload. Route to your event pipeline.
- Validate and launch: Run a dry-run with synthetic traffic to verify assignment distribution. Confirm event pipeline ingestion. Start the experiment and monitor assignment latency and event drop rates.
- Analyze post-launch: Wait until pre-calculated sample size is reached. Apply the registered statistical test. Compare against holdout. Document results and archive the configuration.