High Availability

architecture

Architecture Types

Active-Active

All nodes actively serve traffic simultaneously with load distributed across them. Provides maximum utilization and performance. If one node fails, others continue serving traffic with reduced capacity. Requires session management and data synchronization between nodes.

Similar Technologies
Active-PassiveMulti-MasterDistributed ActiveLoad-Balanced ClusterHorizontal Scaling
Active-Passive

One node serves traffic (active) while others remain on standby (passive). Passive nodes take over only during active node failure. Simple to implement with clear failover logic. Standby resources are underutilized but provide clear recovery path.

Similar Technologies
Active-ActiveHot StandbyMaster-SlavePrimary-SecondaryFailover Cluster
Stateless Application Design

Applications don't store session data locally, enabling any instance to serve any request. Facilitates horizontal scaling and simple failover. Session state stored in external cache (Redis) or database. Critical pattern for cloud-native HA.

Similar Technologies
Sticky SessionsSession ReplicationClient-side StateToken-based AuthStateful Design
Stateful Application HA

Applications maintain session state requiring session replication or sticky sessions. More complex HA with session affinity to specific nodes. Uses session clustering or distributed caches. Common with legacy applications or applications requiring in-memory state.

Similar Technologies
Stateless DesignSession StoreSession AffinitySession ClusteringDistributed Cache
content_copy

Redundancy Models

N+1 Redundancy

System has N components needed to handle load, plus 1 spare. If one fails, spare takes over with no impact. Cost-effective balance between redundancy and cost. Example: 3 servers handling load that 2 could handle, so 1 failure is tolerated.

Similar Technologies
N+M2NActive-ActiveFull RedundancyMinimal Redundancy
N+M Redundancy

System has N components needed plus M spares, tolerating up to M failures. More resilient than N+1 for critical systems. Example: 5 servers handling load that 3 could manage, tolerating 2 failures. Higher cost but better fault tolerance.

Similar Technologies
N+12NOver-provisioningMultiple SparesHigh Redundancy
2N Redundancy

Complete duplication of all components - twice what's needed for normal operation. Highest availability with full active-active operation. Expensive but eliminates single points of failure. All components actively used, so capacity remains after single failure.

Similar Technologies
N+1N+MFull DuplicationActive-ActiveMirror Configuration
Geographic Redundancy

Resources distributed across multiple geographic locations (availability zones or regions). Protects against datacenter or regional failures. Critical for disaster recovery and global services. Adds complexity with data synchronization and latency considerations.

Similar Technologies
Single Location HAMulti-ZoneMulti-RegionGlobal DistributionGeo-Replication
database

Database HA Patterns

Master-Slave Replication

One master database handles writes, slaves replicate data and handle reads. Simple pattern scaling read capacity. Asynchronous replication may have lag. If master fails, promote slave to master. Common with MySQL, PostgreSQL, MongoDB.

Similar Technologies
Master-MasterMulti-MasterShardingRead ReplicasClustering
Master-Master Replication

Multiple master databases accepting writes, bidirectionally replicating changes. Provides write scalability and HA. Complex conflict resolution needed. Requires careful application design. Used in multi-region active-active architectures.

Similar Technologies
Master-SlaveShardingDistributed DatabaseMulti-Region MasterConflict-free Replicated Data
Database Clustering

Multiple database nodes acting as single system with shared storage or distributed consensus. Automatic failover with no data loss. Examples: Oracle RAC, SQL Server AlwaysOn, PostgreSQL Patroni, MySQL Group Replication. Higher availability than replication.

Similar Technologies
ReplicationShardingDistributed DatabaseHigh-Availability GroupsShared-Nothing Architecture
Read Replicas

Read-only copies of database scaling read queries. Asynchronous replication from master with eventual consistency. Reduces master load for reporting and analytics. Available in RDS, Aurora, Cloud SQL. Can be cross-region for DR.

Similar Technologies
CachingDatabase ShardingCQRSRead-Through CacheQuery Offloading
Distributed Databases

Database distributed across nodes with built-in replication and partitioning. Cassandra, DynamoDB, CockroachDB offer tunable consistency and availability. No single master, peer-to-peer architecture. Scales horizontally with automatic data distribution.

Similar Technologies
Traditional ClusteringMaster-SlaveShardingNewSQL DatabasesMulti-Master
Synchronous vs Asynchronous Replication

Synchronous waits for replicas before commit (zero data loss but higher latency). Asynchronous doesn't wait (lower latency but potential data loss). Choose based on RPO: synchronous for zero RPO, asynchronous for performance. Configurable in most databases.

Similar Technologies
Semi-SynchronousGroup ReplicationLogical ReplicationPhysical ReplicationSnapshot Replication
hub

Cluster Configurations

Quorum-Based Clustering

Cluster requires majority (quorum) of nodes to be operational. Prevents split-brain scenarios where cluster splits into multiple independent groups. Common in etcd, Consul, ZooKeeper. Odd number of nodes recommended (3, 5, 7) for clear majority.

Similar Technologies
Witness NodeFencingSTONITHTiebreakerConsensus Protocol
Split-Brain Prevention

Mechanisms preventing cluster from splitting into independent segments each thinking they're primary. Uses quorum, fencing, STONITH (Shoot The Other Node In The Head), or witness nodes. Critical for data integrity in HA clusters.

Similar Technologies
QuorumFencingArbitratorNetwork Partition HandlingCluster Membership
Heartbeat Monitoring

Nodes send periodic heartbeat messages proving they're alive. Missed heartbeats trigger failover. Configure timeout carefully - too short causes false positives, too long delays failover. Use multiple network paths for heartbeat redundancy.

Similar Technologies
Health ChecksKeepalivedCluster MonitoringFailure DetectionLiveness Probes
Automatic Failover

System automatically detects failures and promotes standby to active without human intervention. Reduces downtime but requires careful testing to avoid false positives. Includes health checks, decision logic, and promotion process. Common in RDS, load balancers, Kubernetes.

Similar Technologies
Manual FailoverSemi-Automatic FailoverOperator InterventionHealth-Check BasedConsensus-Based Failover
Shared Storage Clustering

Multiple nodes access shared storage (SAN, NAS). Active node has exclusive access, standby takes over storage during failure. Simpler than distributed systems but shared storage is potential SPOF. Traditional enterprise HA approach.

Similar Technologies
Shared-NothingDistributed StorageReplicated StorageLocal Storage HACloud Block Storage
Shared-Nothing Architecture

Each node has independent storage and resources with no shared components. Data replicated between nodes. More scalable than shared storage and no single point of failure. Modern cloud-native approach used by distributed databases.

Similar Technologies
Shared StorageShared DiskDistributed DatabaseMicroservicesIndependent Nodes
calculate

SLA & Availability Calculations

99.9% Availability (Three Nines)

43.8 minutes of downtime per month (8.76 hours per year). Standard SLA for many cloud services. Achievable with single region, basic redundancy, and well-tested failover. Suitable for non-critical applications with tolerance for brief outages.

Similar Technologies
99.95%99.99%Lower SLAHigher SLACustom SLA
99.99% Availability (Four Nines)

4.38 minutes of downtime per month (52.6 minutes per year). Requires multi-AZ deployment, automated failover, and robust monitoring. Common target for business-critical applications. Significantly more expensive than 99.9% due to redundancy requirements.

Similar Technologies
99.9%99.999%Three NinesFive NinesEnterprise SLA
99.999% Availability (Five Nines)

26.3 seconds of downtime per month (5.26 minutes per year). Extremely high availability requiring multi-region active-active, automatic failover, and extensive testing. Very expensive to achieve and maintain. Reserved for mission-critical systems.

Similar Technologies
99.99%99.9999%Four NinesSix NinesUltra-High Availability
Composite Availability

Overall system availability calculated from component availabilities. For serial components multiply (A × B), for parallel components use 1 - (1-A)(1-B). Example: Two 99.9% components in series = 99.8%, in parallel = 99.9999%. Design for redundancy to improve composite availability.

Similar Technologies
Single ComponentReliability Block DiagramFault Tree AnalysisAvailability CalculationSystem Reliability
Availability vs Reliability

Availability is uptime percentage (system is operational). Reliability is probability of failure-free operation over time (MTBF). High availability systems can be unreliable if they fail frequently but recover quickly. Both metrics needed for complete picture.

Similar Technologies
Uptime PercentageMTBFDurabilityResilienceFault Tolerance
Planned vs Unplanned Downtime

Planned downtime is scheduled maintenance (deployments, patches). Unplanned is failures and disasters. SLAs may exclude planned downtime. Use blue-green deployments and rolling updates to eliminate planned downtime and improve availability.

Similar Technologies
Total DowntimeMaintenance WindowsZero-Downtime DeploymentsScheduled OutagesEmergency Maintenance
balance

Load Balancer HA

Multi-AZ Load Balancers

Load balancer deployed across multiple availability zones. Automatically routes traffic away from failed AZs. Cloud load balancers (ALB, NLB, Azure LB, Cloud Load Balancing) are multi-AZ by default. Critical for application-level high availability.

Similar Technologies
Single AZ LBMulti-Region LBDNS Load BalancingGlobal Load BalancerRegional LB
Health Checks

Load balancer probes backend targets to verify they're healthy. Unhealthy targets removed from rotation. Configure appropriate check intervals, thresholds, and endpoints. Essential for automatic failover and preventing traffic to failed instances.

Similar Technologies
HeartbeatLiveness ProbeReadiness ProbeTCP CheckHTTP Health Check
Connection Draining

When removing instance from load balancer, allow existing connections to complete before termination. Prevents disruption to active requests during deployments or scaling. Configurable timeout (e.g., 300 seconds). Also called deregistration delay.

Similar Technologies
Immediate TerminationGraceful ShutdownConnection TimeoutRequest CompletionSession Drain
Cross-Zone Load Balancing

Distribute traffic evenly across instances in all enabled AZs, not just within each AZ. Improves distribution but increases cross-AZ data transfer costs. Enabled by default in ALB, optional in NLB. Consider for uneven instance distribution.

Similar Technologies
Single Zone LBAZ-Local BalancingZone-Aware RoutingRegional DistributionGeo-Load Balancing
Global Load Balancing

Route traffic based on geography, latency, or health across multiple regions. Uses DNS-based (Route 53, Traffic Manager, Cloud DNS) or anycast routing. Provides DR failover and performance optimization. Critical for global applications.

Similar Technologies
Regional LBDNS FailoverGeoDNSAnycast RoutingCDN Load Balancing
Sticky Sessions vs Stateless

Sticky sessions route user to same backend (session affinity), required for stateful apps but reduces flexibility. Stateless apps allow any backend, enabling better distribution and simpler failover. Prefer stateless with external session store.

Similar Technologies
Session ReplicationExternal Session StoreClient-Side SessionsRound RobinLeast Connections