High Availability
Architecture Types
All nodes actively serve traffic simultaneously with load distributed across them. Provides maximum utilization and performance. If one node fails, others continue serving traffic with reduced capacity. Requires session management and data synchronization between nodes.
One node serves traffic (active) while others remain on standby (passive). Passive nodes take over only during active node failure. Simple to implement with clear failover logic. Standby resources are underutilized but provide clear recovery path.
Applications don't store session data locally, enabling any instance to serve any request. Facilitates horizontal scaling and simple failover. Session state stored in external cache (Redis) or database. Critical pattern for cloud-native HA.
Applications maintain session state requiring session replication or sticky sessions. More complex HA with session affinity to specific nodes. Uses session clustering or distributed caches. Common with legacy applications or applications requiring in-memory state.
Redundancy Models
System has N components needed to handle load, plus 1 spare. If one fails, spare takes over with no impact. Cost-effective balance between redundancy and cost. Example: 3 servers handling load that 2 could handle, so 1 failure is tolerated.
System has N components needed plus M spares, tolerating up to M failures. More resilient than N+1 for critical systems. Example: 5 servers handling load that 3 could manage, tolerating 2 failures. Higher cost but better fault tolerance.
Complete duplication of all components - twice what's needed for normal operation. Highest availability with full active-active operation. Expensive but eliminates single points of failure. All components actively used, so capacity remains after single failure.
Resources distributed across multiple geographic locations (availability zones or regions). Protects against datacenter or regional failures. Critical for disaster recovery and global services. Adds complexity with data synchronization and latency considerations.
Database HA Patterns
One master database handles writes, slaves replicate data and handle reads. Simple pattern scaling read capacity. Asynchronous replication may have lag. If master fails, promote slave to master. Common with MySQL, PostgreSQL, MongoDB.
Multiple master databases accepting writes, bidirectionally replicating changes. Provides write scalability and HA. Complex conflict resolution needed. Requires careful application design. Used in multi-region active-active architectures.
Multiple database nodes acting as single system with shared storage or distributed consensus. Automatic failover with no data loss. Examples: Oracle RAC, SQL Server AlwaysOn, PostgreSQL Patroni, MySQL Group Replication. Higher availability than replication.
Read-only copies of database scaling read queries. Asynchronous replication from master with eventual consistency. Reduces master load for reporting and analytics. Available in RDS, Aurora, Cloud SQL. Can be cross-region for DR.
Database distributed across nodes with built-in replication and partitioning. Cassandra, DynamoDB, CockroachDB offer tunable consistency and availability. No single master, peer-to-peer architecture. Scales horizontally with automatic data distribution.
Synchronous waits for replicas before commit (zero data loss but higher latency). Asynchronous doesn't wait (lower latency but potential data loss). Choose based on RPO: synchronous for zero RPO, asynchronous for performance. Configurable in most databases.
Cluster Configurations
Cluster requires majority (quorum) of nodes to be operational. Prevents split-brain scenarios where cluster splits into multiple independent groups. Common in etcd, Consul, ZooKeeper. Odd number of nodes recommended (3, 5, 7) for clear majority.
Mechanisms preventing cluster from splitting into independent segments each thinking they're primary. Uses quorum, fencing, STONITH (Shoot The Other Node In The Head), or witness nodes. Critical for data integrity in HA clusters.
Nodes send periodic heartbeat messages proving they're alive. Missed heartbeats trigger failover. Configure timeout carefully - too short causes false positives, too long delays failover. Use multiple network paths for heartbeat redundancy.
System automatically detects failures and promotes standby to active without human intervention. Reduces downtime but requires careful testing to avoid false positives. Includes health checks, decision logic, and promotion process. Common in RDS, load balancers, Kubernetes.
Multiple nodes access shared storage (SAN, NAS). Active node has exclusive access, standby takes over storage during failure. Simpler than distributed systems but shared storage is potential SPOF. Traditional enterprise HA approach.
Each node has independent storage and resources with no shared components. Data replicated between nodes. More scalable than shared storage and no single point of failure. Modern cloud-native approach used by distributed databases.
SLA & Availability Calculations
43.8 minutes of downtime per month (8.76 hours per year). Standard SLA for many cloud services. Achievable with single region, basic redundancy, and well-tested failover. Suitable for non-critical applications with tolerance for brief outages.
4.38 minutes of downtime per month (52.6 minutes per year). Requires multi-AZ deployment, automated failover, and robust monitoring. Common target for business-critical applications. Significantly more expensive than 99.9% due to redundancy requirements.
26.3 seconds of downtime per month (5.26 minutes per year). Extremely high availability requiring multi-region active-active, automatic failover, and extensive testing. Very expensive to achieve and maintain. Reserved for mission-critical systems.
Overall system availability calculated from component availabilities. For serial components multiply (A × B), for parallel components use 1 - (1-A)(1-B). Example: Two 99.9% components in series = 99.8%, in parallel = 99.9999%. Design for redundancy to improve composite availability.
Availability is uptime percentage (system is operational). Reliability is probability of failure-free operation over time (MTBF). High availability systems can be unreliable if they fail frequently but recover quickly. Both metrics needed for complete picture.
Planned downtime is scheduled maintenance (deployments, patches). Unplanned is failures and disasters. SLAs may exclude planned downtime. Use blue-green deployments and rolling updates to eliminate planned downtime and improve availability.
Load Balancer HA
Load balancer deployed across multiple availability zones. Automatically routes traffic away from failed AZs. Cloud load balancers (ALB, NLB, Azure LB, Cloud Load Balancing) are multi-AZ by default. Critical for application-level high availability.
Load balancer probes backend targets to verify they're healthy. Unhealthy targets removed from rotation. Configure appropriate check intervals, thresholds, and endpoints. Essential for automatic failover and preventing traffic to failed instances.
When removing instance from load balancer, allow existing connections to complete before termination. Prevents disruption to active requests during deployments or scaling. Configurable timeout (e.g., 300 seconds). Also called deregistration delay.
Distribute traffic evenly across instances in all enabled AZs, not just within each AZ. Improves distribution but increases cross-AZ data transfer costs. Enabled by default in ALB, optional in NLB. Consider for uneven instance distribution.
Route traffic based on geography, latency, or health across multiple regions. Uses DNS-based (Route 53, Traffic Manager, Cloud DNS) or anycast routing. Provides DR failover and performance optimization. Critical for global applications.
Sticky sessions route user to same backend (session affinity), required for stateful apps but reduces flexibility. Stateless apps allow any backend, enabling better distribution and simpler failover. Prefer stateless with external session store.
