Observability
Three Pillars of Observability
Timestamped records of discrete events in system. Structured (JSON with fields) or unstructured (plain text) logs. Centralized aggregation (ELK, Loki, CloudWatch Logs) for search and analysis. Debug issues, audit trails, security investigations. High volume requires efficient storage and querying. Log levels (DEBUG, INFO, WARN, ERROR).
Numerical measurements over time stored as time-series data. Examples: CPU utilization, request rate, error rate, response time. Aggregated and sampled for efficiency. Prometheus, CloudWatch, Datadog for collection and storage. Dashboards for visualization. Alerting on thresholds. Lower storage cost than logs.
End-to-end view of request flow through distributed system. Spans representing operations with parent-child relationships. Correlation IDs linking related spans. Identifies bottlenecks and failures in microservices. Distributed tracing (Jaeger, X-Ray, Zipkin). Critical for understanding microservice interactions. OpenTelemetry for standardization.
Golden Signals (Google SRE)
Time to service request - how long does it take? Measure P50, P95, P99 percentiles (not just average). Distinguish successful vs failed request latency. High latency indicates performance issues or resource constraints. Set SLO targets (e.g., P95 < 200ms). Critical user experience metric.
Demand on system - how much activity? Requests per second, transactions per second, concurrent users, bandwidth usage. Understand capacity and growth trends. Correlate with other signals for context. Traffic spikes may be legitimate or attacks. Used for capacity planning and scaling decisions.
Rate of failed requests - what percentage fails? HTTP 5xx, exceptions, failed transactions. Distinguish user errors (4xx) from system errors (5xx). Error budget consumption for SLO tracking. Alert on error rate increase. Investigate error patterns and root causes. Critical for reliability measurement.
System resource utilization - how full is the system? CPU, memory, disk, network usage. Queue depth and wait times. Constrained resources cause performance degradation. Preventive scaling before saturation. Different services have different constraints. Monitor I/O, connection pools, thread pools.
Monitoring Types
Monitor hosts, VMs, containers, networks. CPU, memory, disk, network metrics. OS-level metrics and health. Agent-based (CloudWatch Agent, Datadog Agent) or agentless collection. Infrastructure as foundation for application performance. Integration with auto-scaling and remediation. Tools: CloudWatch, Azure Monitor, Prometheus.
Deep visibility into application behavior. Transaction tracing, code-level profiling, error tracking. Distributed tracing across microservices. Database query performance. External service calls. Root cause analysis. Language-specific agents (Java, .NET, Node, Python). Tools: New Relic, Datadog APM, AppDynamics, Dynatrace, X-Ray.
Proactive testing simulating user actions. Scripted transactions running continuously from multiple locations. Detect issues before users do. API endpoint monitoring. Multi-step user flows. Measures availability and performance. Complements real user monitoring. Tools: Datadog Synthetics, New Relic Synthetics, Pingdom, Uptime monitoring.
Capture actual user experience data from browsers or mobile apps. Page load times, JavaScript errors, user interactions. Geographic performance variations. Device and browser breakdown. Core Web Vitals (LCP, FID, CLS). Session replays for debugging. Tools: Datadog RUM, New Relic Browser, Google Analytics, LogRocket.
Monitor database performance, queries, connections. Slow query identification and optimization. Connection pool utilization. Replication lag. Lock contention and deadlocks. Storage and IOPS usage. Query execution plans. Tools: CloudWatch RDS metrics, Azure SQL Insights, Query Performance Insight, Datadog Database Monitoring, SolarWinds.
Centralized log aggregation and analysis. Real-time log search and filtering. Pattern detection and anomaly identification. Log-based metrics and alerts. Correlation with other telemetry. Compliance and audit trails. Structured logging for machine parsing. Tools: ELK Stack, Loki, Splunk, Sumo Logic, CloudWatch Logs Insights.
SLI, SLO, SLA
Quantitative measure of service level. Examples: availability percentage, latency P99, error rate, throughput. Must be measurable and meaningful to users. Multiple SLIs per service covering different aspects. Foundation for SLOs. Good SLIs align with user happiness. Avoid vanity metrics - focus on user impact.
Target value or range for SLI. Example: 99.9% availability, P95 latency < 200ms, < 0.1% error rate. Internal goal, more stringent than SLA. Basis for error budget. Set based on user expectations and business requirements. Multiple SLOs possible. Review and adjust periodically. Balance ambition with achievability.
Contract with users/customers specifying service commitments. Legal/financial consequences for breaches. Should be less aggressive than internal SLOs to provide buffer. Includes measurement method, time period, consequences. External promise. Difficult to change once published. SLA violations result in credits or penalties.
Acceptable amount of unreliability - inverse of SLO. 99.9% SLO = 0.1% error budget = 43.8 min/month downtime allowed. Consumed by outages, degraded performance, errors. When exhausted, freeze feature releases and focus on reliability. Balances innovation (new features) with reliability. Shared ownership between dev and ops.
Define consequences when error budget is exhausted. Example: freeze releases, redirect team to reliability work, defer new features. Review during incident retrospectives. Automated alerts on budget burn rate. Encourages proactive reliability work. Enables data-driven discussions between product and engineering. Part of SRE practices.
Observability Best Practices
Too many alerts lead to ignored or missed critical alerts. Set alert thresholds based on actionability - only alert when human action needed. Use severity levels (P1 critical, P2 high, P3 medium). Group related alerts. Alert on symptoms (user impact) not causes. Auto-remediation for known issues. Regular alert hygiene - disable noisy alerts.
Structured on-call rotations with primary and secondary. Runbooks for common issues. Escalation procedures and clear ownership. Reasonable on-call burden (not every night). Compensation or time off for on-call work. Post-incident reviews for learning. On-call training and shadowing. Tools: PagerDuty, Opsgenie, VictorOps.
Step-by-step procedures for common operational tasks and incident response. Include troubleshooting steps, commands to run, escalation paths. Living documents updated after incidents. Reduce MTTR with clear procedures. Enable less experienced engineers to resolve issues. Automate runbook steps when possible. Store with code in version control.
Multiple dashboard types: high-level overview for executives, detailed drill-down for engineers, service-specific for teams. Use meaningful visualizations (time series for trends, gauges for current state). Avoid dashboard proliferation - consolidate where possible. Link dashboards to alerts. Include annotations for deployments and incidents. Tools: Grafana, Kibana, Datadog.
Connect three pillars for complete picture. Trace ID in logs for jumping from metrics to specific request trace. Metrics on dashboard link to log search. Unified telemetry with consistent labels/tags. OpenTelemetry for correlated collection. Faster troubleshooting seeing all context. Single pane of glass reducing tool switching.
Essential for microservices understanding request flow across services. Propagate trace context (trace ID, span ID) through calls. Sample traces to manage volume (100% for errors, 1% for success). Service mesh can auto-instrument. Identify slow services and bottlenecks. Tools: Jaeger, Zipkin, X-Ray, Datadog APM, OpenTelemetry.
