mless identity merge upon signup.
Architecture Decision: Use an event-sourcing pattern for user actions. This allows replayability for debugging onboarding drop-offs and provides the raw data for calculating "Aha" moments.
TypeScript Implementation: Telemetry Client
This client handles batching, offline queuing, and identity management.
interface TelemetryEvent {
event: string;
properties: Record<string, any>;
timestamp: number;
userId?: string;
sessionId: string;
}
class TelemetryClient {
private queue: TelemetryEvent[] = [];
private batchSize = 20;
private flushInterval = 5000;
constructor(private endpoint: string, private apiKey: string) {
this.startFlushLoop();
}
track(event: string, properties: Record<string, any>, userId?: string): void {
const eventObj: TelemetryEvent = {
event,
properties,
timestamp: Date.now(),
userId,
sessionId: this.getSessionId(),
};
this.queue.push(eventObj);
if (this.queue.length >= this.batchSize) {
this.flush();
}
}
identify(userId: string): void {
// Send identify event to merge anonymous session with user
this.track('$identify', { userId }, userId);
// Update future events
this.queue.forEach(e => e.userId = userId);
}
private async flush(): Promise<void> {
if (this.queue.length === 0) return;
const batch = this.queue.splice(0, this.batchSize);
try {
await fetch(this.endpoint, {
method: 'POST',
headers: { 'Authorization': `Bearer ${this.apiKey}` },
body: JSON.stringify(batch),
});
} catch (error) {
// Re-queue on failure for reliability
this.queue.push(...batch);
console.error('Telemetry flush failed', error);
}
}
private startFlushLoop(): void {
setInterval(() => this.flush(), this.flushInterval);
}
private getSessionId(): string {
// Implementation for session persistence
return 'session-id-placeholder';
}
}
2. Dynamic Onboarding Engine
Static checklists kill conversion. Onboarding must be dynamic, driven by feature flags and user intent. The system should detect the user's goal (e.g., "Import Data" vs. "Invite Team") and surface relevant steps while hiding irrelevant ones.
Architecture Decision: Decouple onboarding state from the core domain model. Store onboarding progress in a separate service to allow A/B testing of flows without migrating domain tables.
TypeScript Implementation: Onboarding Controller
Uses feature flags to determine flow and telemetry to skip steps based on user action.
interface OnboardingStep {
id: string;
condition: (user: User, telemetry: TelemetryClient) => boolean;
action: () => Promise<void>;
}
class OnboardingEngine {
private steps: OnboardingStep[] = [
{
id: 'create-project',
condition: (u) => u.projects.length === 0,
action: () => this.promptCreateProject(),
},
{
id: 'invite-team',
condition: (u, t) => u.projects.length > 0 && u.teamMembers === 0,
action: () => this.promptInviteTeam(),
},
];
async getRecommendedStep(user: User, telemetry: TelemetryClient): Promise<OnboardingStep | null> {
// Filter steps that are not yet completed and match conditions
const activeSteps = this.steps.filter(step =>
!user.completedSteps.includes(step.id) && step.condition(user, telemetry)
);
// Return highest priority step (simplified logic)
return activeSteps[0] || null;
}
async markComplete(stepId: string, userId: string): Promise<void> {
// Update user profile
// Emit telemetry event for analysis
// Trigger post-completion hook (e.g., unlock feature)
}
}
3. Real-Time Usage Meter
Usage-based pricing requires a metering system that is accurate, idempotent, and low-latency. The meter must interface with the billing provider (e.g., Stripe Metered Billing) to enforce limits and trigger upgrades.
Architecture Decision: Implement a "Metering Aggregator" service. Raw events are collected in a time-series database or stream, aggregated in near real-time, and pushed to the billing provider. This prevents billing API rate limits and ensures consistency.
TypeScript Implementation: Usage Meter
Handles batching and idempotency keys for billing updates.
interface UsageRecord {
featureId: string;
quantity: number;
timestamp: number;
customerId: string;
}
class UsageMeter {
private buffer: Map<string, UsageRecord[]> = new Map();
private billingClient: BillingClient;
constructor(billingClient: BillingClient) {
this.billingClient = billingClient;
}
async recordUsage(customerId: string, featureId: string, quantity: number): Promise<void> {
const key = `${customerId}:${featureId}`;
if (!this.buffer.has(key)) {
this.buffer.set(key, []);
}
this.buffer.get(key)!.push({
customerId,
featureId,
quantity,
timestamp: Date.now(),
});
// Flush buffer if size threshold met
if (this.buffer.get(key)!.length >= 10) {
await this.flushUsage(key);
}
}
private async flushUsage(key: string): Promise<void> {
const records = this.buffer.get(key)!;
if (records.length === 0) return;
const totalQuantity = records.reduce((sum, r) => sum + r.quantity, 0);
const latestTimestamp = records[records.length - 1].timestamp;
// Idempotency key prevents double billing on retries
const idempotencyKey = this.generateIdempotencyKey(key, latestTimestamp);
try {
await this.billingClient.reportUsage(
records[0].customerId,
records[0].featureId,
totalQuantity,
{ idempotencyKey, timestamp: latestTimestamp }
);
// Clear buffer on success
this.buffer.delete(key);
} catch (error) {
// Handle retry logic or dead-letter queue
console.error('Usage reporting failed', error);
}
}
private generateIdempotencyKey(key: string, timestamp: number): string {
return `${key}-${timestamp}`;
}
}
Architecture Rationale
- Decoupling: Telemetry, Onboarding, and Metering are separate services. This allows independent scaling and deployment. Onboarding changes do not risk billing accuracy.
- Idempotency: Billing operations are non-idempotent by nature. The metering layer must enforce idempotency keys to prevent revenue leakage or customer disputes.
- Event-Driven: Usage events flow through a message queue (e.g., Kafka, SQS) to the metering aggregator, ensuring high throughput during traffic spikes without blocking the request path.
Pitfall Guide
1. Tracking Everything, Learning Nothing
Mistake: Instrumenting every click results in data noise. Teams drown in metrics and cannot identify the North Star behavior.
Best Practice: Define a single North Star Metric (NSM) before implementation. Track only events that correlate with the NSM. Use sampling for high-volume, low-value events.
2. Hardcoding "Aha" Moments
Mistake: Engineering hardcodes thresholds (e.g., "User sent 5 messages") as the definition of activation. User behavior evolves, rendering these thresholds obsolete.
Best Practice: Store activation thresholds in a configuration service or experiment platform. Use statistical analysis of retained vs. churned users to dynamically update thresholds quarterly.
3. Ignoring Free Tier Abuse
Mistake: PLG products with generous free tiers attract bad actors who exploit APIs or storage limits, inflating infrastructure costs.
Best Practice: Implement rate limiting and anomaly detection on the metering layer. Set hard caps on usage that trigger immediate suspension, not just warnings. Monitor cost-per-user in the free tier.
4. Asynchronous Billing Latency
Mistake: Usage is reported to the billing provider with a 24-hour delay. Users hit limits but can still use the product, or upgrades are delayed, causing frustration.
Best Practice: Implement a local usage cache for real-time limit enforcement. Report to the billing provider asynchronously, but block or warn users based on local state. Ensure the local cache is eventually consistent with the billing provider.
5. Over-Engineering Onboarding
Mistake: Building complex, multi-modal onboarding flows that require database migrations or heavy client-side logic.
Best Practice: Keep onboarding stateless where possible. Use feature flags to control visibility. The onboarding engine should be lightweight; if the engine fails, the product must remain usable.
6. Schema Drift in Telemetry
Mistake: Frontend teams change event names or property structures without updating the analytics pipeline, breaking dashboards and automated triggers.
Best Practice: Enforce a schema registry for telemetry events. Use code generation to create TypeScript interfaces for events, ensuring type safety across frontend and backend. CI checks should reject events that do not match the schema.
7. Missing Upgrade Path in Error States
Mistake: When a user hits a limit, the error message is generic ("Error 403"). The user does not know how to upgrade.
Best Practice: Every usage limit violation must return a structured error code that the frontend maps to a specific upgrade modal. The error payload should include the subscription_url or checkout_session_id.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early Stage / MVP | Managed Telemetry + Stripe Metered Billing | Speed to market; reduces dev overhead. | Low CapEx, higher variable SaaS costs. |
| High Volume / Scale | Self-hosted ClickHouse + Custom Metering Aggregator | Cost control at scale; full data ownership. | High Dev cost, lower marginal cost per event. |
| Enterprise Hybrid | PLG Core + Sales Assist API | Allows self-serve but captures lead data for sales outreach. | Medium Dev cost; enables larger deal sizes. |
| Regulated Industry | On-prem Metering + Private Telemetry | Compliance requirements for data residency. | High infrastructure cost; limits growth velocity. |
Configuration Template
Use this JSON structure to define your event schema and usage features. This serves as the source of truth for code generation and analytics configuration.
{
"schemaVersion": "1.0.0",
"northStarMetric": {
"event": "core_action_completed",
"threshold": 10,
"window": "30d"
},
"events": [
{
"name": "signup_completed",
"properties": ["plan", "source", "referral_code"],
"triggers": ["onboarding_start"]
},
{
"name": "api_call_made",
"properties": ["endpoint", "latency_ms", "status"],
"metering": {
"featureId": "api_requests",
"aggregation": "sum",
"unit": "request"
}
}
],
"usageFeatures": [
{
"id": "api_requests",
"limits": {
"free": 1000,
"pro": 50000
},
"pricing": {
"overage": 0.001
}
}
]
}
Quick Start Guide
- Initialize Telemetry: Install the telemetry SDK and configure the
track method in your main application entry point. Map anonymous sessions immediately.
- Define First Event: Instrument the
signup_completed event. Ensure it captures source and referral_code for attribution.
- Enable Metering: Add the
UsageMeter middleware to your API gateway. Instrument the api_call_made event to report usage for billing.
- Test Upgrade Flow: Create a test user, exceed the free tier limit, and verify the UI displays the upgrade modal. Confirm the checkout session creates successfully and access updates post-payment.
- Monitor Dashboard: Verify events appear in your analytics dashboard within 2 minutes. Check the usage metering logs for successful billing reports.