mal, reproducible container image. Use multi-stage builds to strip build dependencies. Enforce non-root execution, read-only filesystems where applicable, and explicit entrypoints. Tag images with commit SHA, never latest.
Step 3: Declarative Orchestration
Deploy to Kubernetes or an equivalent orchestrator. Define desired state in YAML manifests. Configure resource requests/limits, readiness/liveness probes, and pod disruption budgets. Use namespaces for environment isolation and RBAC for least-privilege access.
Step 4: GitOps Delivery Pipeline
Treat infrastructure and application manifests as code. Store them in a version-controlled repository. Use a GitOps controller (ArgoCD, Flux) to reconcile cluster state against the repository. Implement progressive delivery (canary, blue/green) with automated rollback on SLO violation.
Step 5: Observability and Self-Healing
Instrument services at the code level. Emit structured logs, OpenTelemetry traces, and Prometheus metrics. Correlate telemetry using trace IDs. Configure horizontal pod autoscaling based on custom metrics. Define SLOs and alert on error budgets, not raw thresholds.
Code Example: Cloud-Native Service Bootstrap (TypeScript)
import { createServer } from 'http';
import { createLogger, format, transports } from 'winston';
import { metrics, counter, histogram } from '@opentelemetry/api-metrics';
// Structured logging aligned with cloud-native standards
const logger = createLogger({
level: 'info',
format: format.combine(format.timestamp(), format.errors({ stack: true }), format.json()),
defaultMeta: { service: 'order-processor', version: process.env.APP_VERSION || '0.0.0' },
transports: [new transports.Console()]
});
// OpenTelemetry metrics for SLO tracking
const requestCounter = counter('http_requests_total', { description: 'Total HTTP requests' });
const requestDuration = histogram('http_request_duration_seconds', { description: 'Request latency' });
let isShuttingDown = false;
const server = createServer((req, res) => {
const start = performance.now();
requestCounter.add(1, { method: req.method, path: req.url });
if (req.url === '/healthz' && req.method === 'GET') {
res.writeHead(200, { 'Content-Type': 'application/json' });
return res.end(JSON.stringify({ status: 'healthy', uptime: process.uptime() }));
}
if (req.url === '/readyz' && req.method === 'GET') {
const ready = !isShuttingDown && process.memoryUsage().heapUsed < 500 * 1024 * 1024;
res.writeHead(ready ? 200 : 503, { 'Content-Type': 'application/json' });
return res.end(JSON.stringify({ status: ready ? 'ready' : 'not_ready' }));
}
// Business logic placeholder
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ message: 'processed' }));
const duration = (performance.now() - start) / 1000;
requestDuration.record(duration, { method: req.method, path: req.url });
});
// Graceful shutdown aligned with K8s terminationGracePeriodSeconds
const shutdown = async (signal: string) => {
logger.info(`Received ${signal}. Initiating graceful shutdown.`);
isShuttingDown = true;
server.close(() => {
logger.info('HTTP server closed. Exiting process.');
process.exit(0);
});
setTimeout(() => {
logger.error('Forced shutdown after timeout.');
process.exit(1);
}, 15000);
};
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));
const PORT = parseInt(process.env.PORT || '3000', 10);
server.listen(PORT, () => {
logger.info(`Service listening on port ${PORT}`);
});
Architecture Decisions and Rationale
- Kubernetes over proprietary orchestrators: Standardized API, vendor-neutral runtime, mature GitOps ecosystem, and widespread operator pattern support. Proprietary platforms lock teams into specific scaling models and observability stacks.
- Sidecar pattern for observability: Decouples instrumentation from business logic. Enables language-agnostic telemetry collection, centralized log routing, and independent versioning of monitoring agents.
- GitOps over push-based CI/CD: Pull-based reconciliation ensures drift detection, audit trails, and declarative state management. Push pipelines lack cluster-state verification and encourage configuration sprawl.
- SLO-driven alerting over threshold-based: Alerts on error budget consumption reduce alert fatigue and align operations with user experience. Raw CPU/memory thresholds trigger on noise, not impact.
Pitfall Guide
-
Treating Pods as VMs: Pods are ephemeral by design. Assuming stable IPs, persistent local storage, or long-running state inside containers breaks horizontal scaling and automated recovery. Best practice: externalize state to managed databases or object storage. Use emptyDir for temporary scratch space only.
-
Hardcoding Configuration: Embedding environment-specific values in application code violates the 12-factor principle and prevents immutable deployments. Best practice: inject configuration via environment variables, ConfigMaps, or Secrets. Validate schemas at startup and fail fast on misconfiguration.
-
Omitting Resource Requests and Limits: Unbounded containers consume cluster resources, trigger OOMKilled events, and cause noisy-neighbor degradation. Best practice: define requests for baseline scheduling and limits for protection. Profile workloads under load to set accurate values. Use Vertical Pod Autoscaler for initial tuning.
-
Deploying Service Mesh Prematurely: Adding Istio or Linkerd before services communicate over HTTP/gRPC with clear retry/circuit-breaker patterns introduces latency, complexity, and debugging overhead. Best practice: implement resilience at the application layer first. Introduce a sidecar mesh only when cross-service traffic management, mTLS, or advanced routing is required.
-
Ignoring Stateful Workload Patterns: Deploying databases, message brokers, or caches as standard Deployments causes data loss during pod rescheduling. Best practice: use StatefulSets with stable network identities, persistent volume claims, and headless services. Implement backup/restore procedures outside the orchestrator.
-
Monolithic CI/CD Pipelines: Single pipelines that build, test, and deploy all services together create bottlenecks and prevent independent scaling. Best practice: decompose pipelines per service. Use artifact registries for versioned binaries. Promote through environment-specific GitOps repositories.
-
Bolt-On Observability: Adding logging agents or metric scrapers after deployment results in missing trace context, inconsistent labels, and blind spots in async flows. Best practice: instrument at the code level. Propagate trace IDs across HTTP/gRPC and message queues. Correlate logs, metrics, and traces using a unified backend.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Startup MVP | Serverless + managed DB + simple CI/CD | Minimizes operational overhead, accelerates time-to-market | Low initial, scales linearly with usage |
| Regulated Enterprise | Kubernetes + GitOps + private registry + audit logging | Enforces compliance, drift control, and reproducible deployments | High upfront, predictable long-term |
| High-Scale SaaS | K8s + service mesh + autoscaling + SLO-driven ops | Handles traffic volatility, enables granular scaling, reduces MTTR | Moderate infrastructure, high engineering investment |
| Legacy Modernization | Strangler pattern + containerized wrappers + gradual decommission | Reduces risk, maintains uptime, allows incremental refactoring | Medium, offset by decommissioned licensing costs |
Configuration Template
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-processor
namespace: production
labels:
app: order-processor
version: v1.4.2
spec:
replicas: 3
selector:
matchLabels:
app: order-processor
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: order-processor
version: v1.4.2
spec:
terminationGracePeriodSeconds: 30
containers:
- name: order-processor
image: registry.internal/order-processor:sha-abc1234
ports:
- containerPort: 3000
envFrom:
- configMapRef:
name: order-processor-config
- secretRef:
name: order-processor-secrets
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 2
---
apiVersion: v1
kind: Service
metadata:
name: order-processor-svc
namespace: production
spec:
selector:
app: order-processor
ports:
- port: 80
targetPort: 3000
protocol: TCP
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: order-processor-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: order-processor
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Quick Start Guide
- Initialize a local cluster: Run
kind create cluster --name cloud-native-lab to provision a lightweight Kubernetes environment on your workstation.
- Build and load the image: Execute
docker build -t order-processor:local . && kind load docker-image order-processor:local --name cloud-native-lab to compile the TypeScript service and inject it into the cluster's local registry.
- Deploy manifests: Apply the configuration with
kubectl apply -f deployment.yaml. Verify pod status using kubectl get pods -n production -w until all reach Running.
- Validate readiness and autoscaling: Curl the readiness endpoint
kubectl port-forward svc/order-processor-svc 8080:80 -n production and verify /readyz returns 200. Generate load with kubectl run load-test --image=busybox --restart=Never --command -- wget -q -O- http://order-processor-svc/ and observe HPA scaling via kubectl get hpa -n production.