Observability

view_in_ar

Three Pillars of Observability

Logs

Timestamped records of discrete events in system. Structured (JSON with fields) or unstructured (plain text) logs. Centralized aggregation (ELK, Loki, CloudWatch Logs) for search and analysis. Debug issues, audit trails, security investigations. High volume requires efficient storage and querying. Log levels (DEBUG, INFO, WARN, ERROR).

Similar Technologies
Event StreamsAudit TrailsSystem LogsApplication LogsStructured Logging
Metrics

Numerical measurements over time stored as time-series data. Examples: CPU utilization, request rate, error rate, response time. Aggregated and sampled for efficiency. Prometheus, CloudWatch, Datadog for collection and storage. Dashboards for visualization. Alerting on thresholds. Lower storage cost than logs.

Similar Technologies
GaugesCountersHistogramsTime-Series DataPerformance Metrics
Traces

End-to-end view of request flow through distributed system. Spans representing operations with parent-child relationships. Correlation IDs linking related spans. Identifies bottlenecks and failures in microservices. Distributed tracing (Jaeger, X-Ray, Zipkin). Critical for understanding microservice interactions. OpenTelemetry for standardization.

Similar Technologies
Distributed TracingAPM TracesRequest CorrelationCall GraphsService Maps
star

Golden Signals (Google SRE)

Latency

Time to service request - how long does it take? Measure P50, P95, P99 percentiles (not just average). Distinguish successful vs failed request latency. High latency indicates performance issues or resource constraints. Set SLO targets (e.g., P95 < 200ms). Critical user experience metric.

Similar Technologies
Response TimeRequest DurationProcessing TimeTime to First ByteEnd-to-End Latency
Traffic

Demand on system - how much activity? Requests per second, transactions per second, concurrent users, bandwidth usage. Understand capacity and growth trends. Correlate with other signals for context. Traffic spikes may be legitimate or attacks. Used for capacity planning and scaling decisions.

Similar Technologies
Request RateThroughputVolumeLoadConcurrent Users
Errors

Rate of failed requests - what percentage fails? HTTP 5xx, exceptions, failed transactions. Distinguish user errors (4xx) from system errors (5xx). Error budget consumption for SLO tracking. Alert on error rate increase. Investigate error patterns and root causes. Critical for reliability measurement.

Similar Technologies
Failure RateException RateHTTP ErrorsTransaction FailuresError Count
Saturation

System resource utilization - how full is the system? CPU, memory, disk, network usage. Queue depth and wait times. Constrained resources cause performance degradation. Preventive scaling before saturation. Different services have different constraints. Monitor I/O, connection pools, thread pools.

Similar Technologies
Resource UtilizationCapacityQueue DepthBottleneck AnalysisResource Pressure
monitoring

Monitoring Types

Infrastructure Monitoring

Monitor hosts, VMs, containers, networks. CPU, memory, disk, network metrics. OS-level metrics and health. Agent-based (CloudWatch Agent, Datadog Agent) or agentless collection. Infrastructure as foundation for application performance. Integration with auto-scaling and remediation. Tools: CloudWatch, Azure Monitor, Prometheus.

Similar Technologies
Host MonitoringServer MonitoringCloud MonitoringSystem MonitoringResource Monitoring
Application Performance Monitoring (APM)

Deep visibility into application behavior. Transaction tracing, code-level profiling, error tracking. Distributed tracing across microservices. Database query performance. External service calls. Root cause analysis. Language-specific agents (Java, .NET, Node, Python). Tools: New Relic, Datadog APM, AppDynamics, Dynatrace, X-Ray.

Similar Technologies
Distributed TracingProfilingError TrackingPerformance AnalyticsApplication Insights
Synthetic Monitoring

Proactive testing simulating user actions. Scripted transactions running continuously from multiple locations. Detect issues before users do. API endpoint monitoring. Multi-step user flows. Measures availability and performance. Complements real user monitoring. Tools: Datadog Synthetics, New Relic Synthetics, Pingdom, Uptime monitoring.

Similar Technologies
Active MonitoringSimulated UsersHealth ChecksUptime MonitoringAPI Monitoring
Real User Monitoring (RUM)

Capture actual user experience data from browsers or mobile apps. Page load times, JavaScript errors, user interactions. Geographic performance variations. Device and browser breakdown. Core Web Vitals (LCP, FID, CLS). Session replays for debugging. Tools: Datadog RUM, New Relic Browser, Google Analytics, LogRocket.

Similar Technologies
End User Experience MonitoringBrowser MonitoringMobile MonitoringFrontend MonitoringClient-side Monitoring
Database Monitoring

Monitor database performance, queries, connections. Slow query identification and optimization. Connection pool utilization. Replication lag. Lock contention and deadlocks. Storage and IOPS usage. Query execution plans. Tools: CloudWatch RDS metrics, Azure SQL Insights, Query Performance Insight, Datadog Database Monitoring, SolarWinds.

Similar Technologies
Query Performance MonitoringDatabase APMSQL MonitoringNoSQL MonitoringQuery Analysis
Log Monitoring

Centralized log aggregation and analysis. Real-time log search and filtering. Pattern detection and anomaly identification. Log-based metrics and alerts. Correlation with other telemetry. Compliance and audit trails. Structured logging for machine parsing. Tools: ELK Stack, Loki, Splunk, Sumo Logic, CloudWatch Logs Insights.

Similar Technologies
Log AnalysisSIEMLog ManagementCentralized LoggingLog Analytics
assessment

SLI, SLO, SLA

SLI (Service Level Indicator)

Quantitative measure of service level. Examples: availability percentage, latency P99, error rate, throughput. Must be measurable and meaningful to users. Multiple SLIs per service covering different aspects. Foundation for SLOs. Good SLIs align with user happiness. Avoid vanity metrics - focus on user impact.

Similar Technologies
KPIsMetricsPerformance IndicatorsHealth MetricsQuality Metrics
SLO (Service Level Objective)

Target value or range for SLI. Example: 99.9% availability, P95 latency < 200ms, < 0.1% error rate. Internal goal, more stringent than SLA. Basis for error budget. Set based on user expectations and business requirements. Multiple SLOs possible. Review and adjust periodically. Balance ambition with achievability.

Similar Technologies
Service TargetsPerformance GoalsReliability TargetsQuality ObjectivesInternal SLAs
SLA (Service Level Agreement)

Contract with users/customers specifying service commitments. Legal/financial consequences for breaches. Should be less aggressive than internal SLOs to provide buffer. Includes measurement method, time period, consequences. External promise. Difficult to change once published. SLA violations result in credits or penalties.

Similar Technologies
Service ContractCustomer AgreementUptime GuaranteePerformance ContractService Commitment
Error Budget

Acceptable amount of unreliability - inverse of SLO. 99.9% SLO = 0.1% error budget = 43.8 min/month downtime allowed. Consumed by outages, degraded performance, errors. When exhausted, freeze feature releases and focus on reliability. Balances innovation (new features) with reliability. Shared ownership between dev and ops.

Similar Technologies
Downtime BudgetFailure BudgetReliability BudgetAvailability TargetTolerance Budget
Error Budget Policy

Define consequences when error budget is exhausted. Example: freeze releases, redirect team to reliability work, defer new features. Review during incident retrospectives. Automated alerts on budget burn rate. Encourages proactive reliability work. Enables data-driven discussions between product and engineering. Part of SRE practices.

Similar Technologies
Incident Response PolicyChange FreezeReliability PolicySLO PolicyRelease Guidelines
rule

Observability Best Practices

Alert Fatigue Prevention

Too many alerts lead to ignored or missed critical alerts. Set alert thresholds based on actionability - only alert when human action needed. Use severity levels (P1 critical, P2 high, P3 medium). Group related alerts. Alert on symptoms (user impact) not causes. Auto-remediation for known issues. Regular alert hygiene - disable noisy alerts.

Similar Technologies
Alert TuningAlert ManagementIntelligent AlertingEvent CorrelationNotification Management
On-Call Best Practices

Structured on-call rotations with primary and secondary. Runbooks for common issues. Escalation procedures and clear ownership. Reasonable on-call burden (not every night). Compensation or time off for on-call work. Post-incident reviews for learning. On-call training and shadowing. Tools: PagerDuty, Opsgenie, VictorOps.

Similar Technologies
Follow-the-SunShared On-CallDedicated Operations Team24/7 NOCIncident Management
Runbooks & Playbooks

Step-by-step procedures for common operational tasks and incident response. Include troubleshooting steps, commands to run, escalation paths. Living documents updated after incidents. Reduce MTTR with clear procedures. Enable less experienced engineers to resolve issues. Automate runbook steps when possible. Store with code in version control.

Similar Technologies
SOPsIncident Response PlansTroubleshooting GuidesKnowledge BaseOperational Procedures
Dashboard Design

Multiple dashboard types: high-level overview for executives, detailed drill-down for engineers, service-specific for teams. Use meaningful visualizations (time series for trends, gauges for current state). Avoid dashboard proliferation - consolidate where possible. Link dashboards to alerts. Include annotations for deployments and incidents. Tools: Grafana, Kibana, Datadog.

Similar Technologies
Monitoring ScreensOperational ViewsService DashboardsNOC DisplaysExecutive Reports
Correlation (Logs + Metrics + Traces)

Connect three pillars for complete picture. Trace ID in logs for jumping from metrics to specific request trace. Metrics on dashboard link to log search. Unified telemetry with consistent labels/tags. OpenTelemetry for correlated collection. Faster troubleshooting seeing all context. Single pane of glass reducing tool switching.

Similar Technologies
Separate ToolsManual CorrelationUnified ObservabilityIntegrated MonitoringCorrelated Events
Distributed Tracing

Essential for microservices understanding request flow across services. Propagate trace context (trace ID, span ID) through calls. Sample traces to manage volume (100% for errors, 1% for success). Service mesh can auto-instrument. Identify slow services and bottlenecks. Tools: Jaeger, Zipkin, X-Ray, Datadog APM, OpenTelemetry.

Similar Technologies
APMRequest CorrelationService Mesh ObservabilitySpan AnalyticsEnd-to-End Tracing