void vanity metrics like "uptime" in favor of availability, latency, and correctness as experienced by the user.
Technical Implementation: Define SLIs as code to ensure they are versioned and reviewed alongside application logic.
// sli-definitions.ts
export interface SLI {
name: string;
description: string;
query: string; // PromQL or equivalent
unit: 'count' | 'duration' | 'bytes';
}
export const USER_FACING_SLI: SLI[] = [
{
name: 'http_request_success_rate',
description: 'Percentage of successful HTTP requests (2xx/3xx) over 500ms.',
query: `sum(rate(http_requests_total{status=~"2..|3.."}[5m])) / sum(rate(http_requests_total[5m]))`,
unit: 'count'
},
{
name: 'p99_latency',
description: '99th percentile latency of API requests.',
query: `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))`,
unit: 'duration'
}
];
// slo-manager.ts
export interface SLO {
service: string;
sli: string;
target: number; // e.g., 0.999 for 99.9%
window: string; // e.g., '30d'
}
export const calculateSLOCompliance = (currentValue: number, target: number): boolean => {
return currentValue >= target;
};
Architecture Decision: Store SLOs in a centralized configuration service. This allows dynamic updates and integration with CI/CD pipelines. The SLO target should reflect the cost of reliability; a 99.99% SLO costs significantly more than 99.9% and should only be applied to critical paths.
Step 2: Implement Error Budgets
An Error Budget is the maximum allowable deviation from the SLO. If an SLO is 99.9%, the error budget is 0.1%. When the budget is exhausted, the organization shifts from feature development to reliability work. This gamifies reliability and aligns incentives.
Technical Implementation: Automate error budget tracking and policy enforcement.
// error-budget.ts
export class ErrorBudgetManager {
private budgetConsumed: number = 0;
private totalBudget: number;
constructor(sloTarget: number, periodMs: number) {
// Budget is (1 - target) * period
this.totalBudget = (1 - sloTarget) * periodMs;
}
recordError(durationMs: number): void {
this.budgetConsumed += durationMs;
}
getRemainingBudget(): number {
return Math.max(0, this.totalBudget - this.budgetConsumed);
}
isBudgetExhausted(): boolean {
return this.budgetConsumed >= this.totalBudget;
}
getBurnRate(): number {
// Current consumption rate vs allowed rate
// Simplified for example; production requires rolling window calculation
return this.budgetConsumed / this.totalBudget;
}
}
// ci-pipeline-gate.ts
export const checkDeploymentEligibility = (budgetManager: ErrorBudgetManager): boolean => {
if (budgetManager.isBudgetExhausted()) {
console.warn('Error budget exhausted. Blocking deployment. Focus on stability.');
return false;
}
// Allow deployment with warnings if burn rate is high
if (budgetManager.getBurnRate() > 0.8) {
console.warn('Error budget burning fast. Review changes carefully.');
}
return true;
};
Rationale: Integrating the error budget check into the CI/CD pipeline ensures that reliability decisions are automated. Developers receive immediate feedback. If the budget is exhausted, the pipeline blocks non-critical deployments, forcing the team to address technical debt and stability issues.
Step 3: Automate Toil
Toil is operational work that is manual, repetitive, automatable, tactical, and lacks enduring value. SRE mandates that no more than 50% of engineering time is spent on toil. Excess toil must be funded by automation projects.
Technical Implementation: Use runbooks and automation scripts to eliminate repetitive tasks.
// toil-automation.ts
import { exec } from 'child_process';
import { promisify } from 'util';
const execAsync = promisify(exec);
export interface ToilTask {
id: string;
description: string;
frequency: 'hourly' | 'daily' | 'weekly';
automationScript: string;
}
export class ToilReducer {
private tasks: ToilTask[] = [];
registerTask(task: ToilTask): void {
this.tasks.push(task);
}
async executeAutomation(taskId: string): Promise<void> {
const task = this.tasks.find(t => t.id === taskId);
if (!task) throw new Error('Task not found');
console.log(`Executing automation for: ${task.description}`);
try {
const { stdout, stderr } = await execAsync(task.automationScript);
if (stderr) console.error(`Automation warning: ${stderr}`);
console.log(`Automation completed: ${stdout}`);
} catch (error) {
// Fallback to alerting human if automation fails
console.error(`Automation failed for ${taskId}. Alerting on-call.`);
await this.alertOnCall(taskId, error);
}
}
private async alertOnCall(taskId: string, error: unknown): Promise<void> {
// Integration with PagerDuty/OpsGenie
// payload: { task_id: taskId, error: error }
}
}
Architecture Decision: Automation scripts should be stored in the same repository as the service code. This ensures that automation evolves with the system and is subject to code review. Failures in automation should trigger alerts, not silent degradation.
Pitfall Guide
- Treating SRE as a Separate Silo: Creating an "SRE Team" that acts as a gatekeeper between developers and production recreates the Dev vs. Ops conflict. SRE is a discipline, not a role. Developers must own reliability. Best practice: Embed SRE principles into development teams; SRE engineers act as coaches and tool builders.
- Setting 100% SLOs: Aiming for 100% reliability is impossible and economically unviable. It leads to paralysis where no changes can be deployed. Best practice: Define SLOs based on user tolerance. 99.9% is sufficient for most services; reserve 99.99% for payment processing or core authentication.
- Alerting on Symptoms Instead of Causes: Alerting on high CPU usage or memory consumption leads to alert fatigue. Users care about service degradation, not resource metrics. Best practice: Alert on SLO violations. If latency is high but resources are fine, the alert fires. This ensures every alert requires action.
- Ignoring Error Budget Exhaustion: Continuing to deploy features after the error budget is exhausted defeats the purpose of the model. It signals that reliability is optional. Best practice: Enforce budget policies in CI/CD. If the budget is gone, the organization must pause feature work until reliability is restored.
- Confusing SLAs with SLOs: Service Level Agreements (SLAs) are contractual commitments with penalties. SLOs are internal targets. Basing SLOs on SLAs leaves no margin for error. Best practice: Set SLOs stricter than SLAs. If the SLA is 99.5%, the SLO should be 99.9% to provide a safety buffer.
- Blameful Post-Mortems: Focusing on "who broke it" discourages transparency and hides systemic issues. Best practice: Conduct blameless post-mortems. Focus on process failures and system design flaws. Ask "why" five times to uncover root causes without assigning personal blame.
- Tooling Obsession Over Culture: Investing in expensive observability platforms without changing the culture yields no results. Best practice: Prioritize cultural shifts. Implement blameless post-mortems, error budget policies, and toil reduction mandates before scaling tooling.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Early-Stage Startup | Light SRE: Focus on SLOs and basic monitoring. | Speed is critical; heavy process slows iteration. | Low |
| Enterprise Legacy Systems | SRE-as-Service with Strangler Fig Pattern. | Gradual migration reduces risk; centralized expertise needed. | Medium |
| Customer-Facing Critical Service | Full SRE: Error Budgets, Chaos Engineering, Automated Remediation. | Downtime directly impacts revenue and trust. | High |
| Internal Tooling | SLOs with relaxed targets; minimal automation. | Internal users tolerate higher latency; ROI on automation is low. | Low |
| High-Traffic Microservices | Automated SLO enforcement in CI/CD. | High complexity requires programmatic governance to prevent cascading failures. | Medium |
Configuration Template
Copy this TypeScript configuration to define SLOs and error budget policies for a service.
// sre-config.ts
import { SLO, ErrorBudgetManager } from './error-budget';
export const SERVICE_SLOS: SLO[] = [
{
service: 'api-gateway',
sli: 'http_request_success_rate',
target: 0.999, // 99.9%
window: '30d'
},
{
service: 'api-gateway',
sli: 'p99_latency',
target: 0.95, // 95% of requests under threshold
window: '30d'
}
];
// Initialize budget managers for each SLO
export const budgetManagers = SERVICE_SLOS.map(slo => {
const windowMs = parseWindowToMs(slo.window);
return new ErrorBudgetManager(slo.target, windowMs);
});
function parseWindowToMs(window: string): number {
// Implementation to parse '30d' to milliseconds
const days = parseInt(window.replace('d', ''), 10);
return days * 24 * 60 * 60 * 1000;
}
Quick Start Guide
- Select a Pilot Service: Choose a non-critical service with existing metrics to pilot SRE practices.
- Define One SLO: Set a single availability SLO (e.g., 99.9%) based on user impact.
- Configure Metrics: Ensure Prometheus or equivalent collects the SLI data. Verify query accuracy.
- Deploy Error Budget Dashboard: Create a Grafana dashboard showing budget consumption and burn rate.
- Enforce Policy: Add a script to your CI pipeline that checks budget status and warns on high burn rates. Review results in the weekly engineering sync.
Site Reliability Engineering transforms reliability from a reactive burden into a proactive, measurable asset. By implementing SLOs, error budgets, and automation, organizations achieve the dual goals of high velocity and high stability. The discipline requires cultural commitment, but the technical implementation provides immediate feedback loops that drive continuous improvement.