Will Percey — Portfolio

Observability

> > Updated Dec 2025

view_in_ar

Three Pillars of Observability

Logs

Timestamped records of discrete events in system. Structured (JSON with fields) or unstructured (plain text) logs. Centralized aggregation (ELK, Loki, CloudWatch Logs) for search and analysis. Debug issues, audit trails, security investigations. High volume requires efficient storage and querying. Log levels (DEBUG, INFO, WARN, ERROR).

Similar Technologies

Event StreamsAudit TrailsSystem LogsApplication LogsStructured Logging

Metrics

Numerical measurements over time stored as time-series data. Examples: CPU utilization, request rate, error rate, response time. Aggregated and sampled for efficiency. Prometheus, CloudWatch, Datadog for collection and storage. Dashboards for visualization. Alerting on thresholds. Lower storage cost than logs.

Similar Technologies

GaugesCountersHistogramsTime-Series DataPerformance Metrics

Traces

End-to-end view of request flow through distributed system. Spans representing operations with parent-child relationships. Correlation IDs linking related spans. Identifies bottlenecks and failures in microservices. Distributed tracing (Jaeger, X-Ray, Zipkin). Critical for understanding microservice interactions. OpenTelemetry for standardization.

Similar Technologies

Distributed TracingAPM TracesRequest CorrelationCall GraphsService Maps

star

Golden Signals (Google SRE)

Latency

Time to service request - how long does it take? Measure P50, P95, P99 percentiles (not just average). Distinguish successful vs failed request latency. High latency indicates performance issues or resource constraints. Set SLO targets (e.g., P95 < 200ms). Critical user experience metric.

Similar Technologies

Response TimeRequest DurationProcessing TimeTime to First ByteEnd-to-End Latency

Traffic

Demand on system - how much activity? Requests per second, transactions per second, concurrent users, bandwidth usage. Understand capacity and growth trends. Correlate with other signals for context. Traffic spikes may be legitimate or attacks. Used for capacity planning and scaling decisions.

Similar Technologies

Request RateThroughputVolumeLoadConcurrent Users

Errors

Rate of failed requests - what percentage fails? HTTP 5xx, exceptions, failed transactions. Distinguish user errors (4xx) from system errors (5xx). Error budget consumption for SLO tracking. Alert on error rate increase. Investigate error patterns and root causes. Critical for reliability measurement.

Similar Technologies

Failure RateException RateHTTP ErrorsTransaction FailuresError Count

Saturation

System resource utilization - how full is the system? CPU, memory, disk, network usage. Queue depth and wait times. Constrained resources cause performance degradation. Preventive scaling before saturation. Different services have different constraints. Monitor I/O, connection pools, thread pools.

Similar Technologies

Resource UtilizationCapacityQueue DepthBottleneck AnalysisResource Pressure

monitoring

Monitoring Types

Infrastructure Monitoring

Monitor hosts, VMs, containers, networks. CPU, memory, disk, network metrics. OS-level metrics and health. Agent-based (CloudWatch Agent, Datadog Agent) or agentless collection. Infrastructure as foundation for application performance. Integration with auto-scaling and remediation. Tools: CloudWatch, Azure Monitor, Prometheus.

Similar Technologies

Host MonitoringServer MonitoringCloud MonitoringSystem MonitoringResource Monitoring

Application Performance Monitoring (APM)

Deep visibility into application behavior. Transaction tracing, code-level profiling, error tracking. Distributed tracing across microservices. Database query performance. External service calls. Root cause analysis. Language-specific agents (Java, .NET, Node, Python). Tools: New Relic, Datadog APM, AppDynamics, Dynatrace, X-Ray.

Similar Technologies

Distributed TracingProfilingError TrackingPerformance AnalyticsApplication Insights

Synthetic Monitoring

Proactive testing simulating user actions. Scripted transactions running continuously from multiple locations. Detect issues before users do. API endpoint monitoring. Multi-step user flows. Measures availability and performance. Complements real user monitoring. Tools: Datadog Synthetics, New Relic Synthetics, Pingdom, Uptime monitoring.

Similar Technologies

Active MonitoringSimulated UsersHealth ChecksUptime MonitoringAPI Monitoring

Real User Monitoring (RUM)

Capture actual user experience data from browsers or mobile apps. Page load times, JavaScript errors, user interactions. Geographic performance variations. Device and browser breakdown. Core Web Vitals (LCP, FID, CLS). Session replays for debugging. Tools: Datadog RUM, New Relic Browser, Google Analytics, LogRocket.

Similar Technologies

End User Experience MonitoringBrowser MonitoringMobile MonitoringFrontend MonitoringClient-side Monitoring

Database Monitoring

Monitor database performance, queries, connections. Slow query identification and optimization. Connection pool utilization. Replication lag. Lock contention and deadlocks. Storage and IOPS usage. Query execution plans. Tools: CloudWatch RDS metrics, Azure SQL Insights, Query Performance Insight, Datadog Database Monitoring, SolarWinds.

Similar Technologies

Query Performance MonitoringDatabase APMSQL MonitoringNoSQL MonitoringQuery Analysis

Log Monitoring

Centralized log aggregation and analysis. Real-time log search and filtering. Pattern detection and anomaly identification. Log-based metrics and alerts. Correlation with other telemetry. Compliance and audit trails. Structured logging for machine parsing. Tools: ELK Stack, Loki, Splunk, Sumo Logic, CloudWatch Logs Insights.

Similar Technologies

Log AnalysisSIEMLog ManagementCentralized LoggingLog Analytics

assessment

SLI, SLO, SLA

SLI (Service Level Indicator)

Quantitative measure of service level. Examples: availability percentage, latency P99, error rate, throughput. Must be measurable and meaningful to users. Multiple SLIs per service covering different aspects. Foundation for SLOs. Good SLIs align with user happiness. Avoid vanity metrics - focus on user impact.

Similar Technologies

KPIsMetricsPerformance IndicatorsHealth MetricsQuality Metrics

SLO (Service Level Objective)

Target value or range for SLI. Example: 99.9% availability, P95 latency < 200ms, < 0.1% error rate. Internal goal, more stringent than SLA. Basis for error budget. Set based on user expectations and business requirements. Multiple SLOs possible. Review and adjust periodically. Balance ambition with achievability.

Similar Technologies

Service TargetsPerformance GoalsReliability TargetsQuality ObjectivesInternal SLAs

SLA (Service Level Agreement)

Contract with users/customers specifying service commitments. Legal/financial consequences for breaches. Should be less aggressive than internal SLOs to provide buffer. Includes measurement method, time period, consequences. External promise. Difficult to change once published. SLA violations result in credits or penalties.

Similar Technologies

Service ContractCustomer AgreementUptime GuaranteePerformance ContractService Commitment

Error Budget

Acceptable amount of unreliability - inverse of SLO. 99.9% SLO = 0.1% error budget = 43.8 min/month downtime allowed. Consumed by outages, degraded performance, errors. When exhausted, freeze feature releases and focus on reliability. Balances innovation (new features) with reliability. Shared ownership between dev and ops.

Similar Technologies

Downtime BudgetFailure BudgetReliability BudgetAvailability TargetTolerance Budget

Error Budget Policy

Define consequences when error budget is exhausted. Example: freeze releases, redirect team to reliability work, defer new features. Review during incident retrospectives. Automated alerts on budget burn rate. Encourages proactive reliability work. Enables data-driven discussions between product and engineering. Part of SRE practices.

Similar Technologies

Incident Response PolicyChange FreezeReliability PolicySLO PolicyRelease Guidelines

rule

Observability Best Practices

Alert Fatigue Prevention

Too many alerts lead to ignored or missed critical alerts. Set alert thresholds based on actionability - only alert when human action needed. Use severity levels (P1 critical, P2 high, P3 medium). Group related alerts. Alert on symptoms (user impact) not causes. Auto-remediation for known issues. Regular alert hygiene - disable noisy alerts.

Similar Technologies

Alert TuningAlert ManagementIntelligent AlertingEvent CorrelationNotification Management

On-Call Best Practices

Structured on-call rotations with primary and secondary. Runbooks for common issues. Escalation procedures and clear ownership. Reasonable on-call burden (not every night). Compensation or time off for on-call work. Post-incident reviews for learning. On-call training and shadowing. Tools: PagerDuty, Opsgenie, VictorOps.

Similar Technologies

Follow-the-SunShared On-CallDedicated Operations Team24/7 NOCIncident Management

Runbooks & Playbooks

Step-by-step procedures for common operational tasks and incident response. Include troubleshooting steps, commands to run, escalation paths. Living documents updated after incidents. Reduce MTTR with clear procedures. Enable less experienced engineers to resolve issues. Automate runbook steps when possible. Store with code in version control.

Similar Technologies

SOPsIncident Response PlansTroubleshooting GuidesKnowledge BaseOperational Procedures

Dashboard Design

Multiple dashboard types: high-level overview for executives, detailed drill-down for engineers, service-specific for teams. Use meaningful visualizations (time series for trends, gauges for current state). Avoid dashboard proliferation - consolidate where possible. Link dashboards to alerts. Include annotations for deployments and incidents. Tools: Grafana, Kibana, Datadog.

Similar Technologies

Monitoring ScreensOperational ViewsService DashboardsNOC DisplaysExecutive Reports

Correlation (Logs + Metrics + Traces)

Connect three pillars for complete picture. Trace ID in logs for jumping from metrics to specific request trace. Metrics on dashboard link to log search. Unified telemetry with consistent labels/tags. OpenTelemetry for correlated collection. Faster troubleshooting seeing all context. Single pane of glass reducing tool switching.

Similar Technologies

Separate ToolsManual CorrelationUnified ObservabilityIntegrated MonitoringCorrelated Events

Distributed Tracing

Essential for microservices understanding request flow across services. Propagate trace context (trace ID, span ID) through calls. Sample traces to manage volume (100% for errors, 1% for success). Service mesh can auto-instrument. Identify slow services and bottlenecks. Tools: Jaeger, Zipkin, X-Ray, Datadog APM, OpenTelemetry.

Similar Technologies

APMRequest CorrelationService Mesh ObservabilitySpan AnalyticsEnd-to-End Tracing