Will Percey — Portfolio

High Availability

> > Updated Dec 2025

architecture

Architecture Types

Active-Active

All nodes actively serve traffic simultaneously with load distributed across them. Provides maximum utilization and performance. If one node fails, others continue serving traffic with reduced capacity. Requires session management and data synchronization between nodes.

Similar Technologies

Active-PassiveMulti-MasterDistributed ActiveLoad-Balanced ClusterHorizontal Scaling

Active-Passive

One node serves traffic (active) while others remain on standby (passive). Passive nodes take over only during active node failure. Simple to implement with clear failover logic. Standby resources are underutilized but provide clear recovery path.

Similar Technologies

Active-ActiveHot StandbyMaster-SlavePrimary-SecondaryFailover Cluster

Stateless Application Design

Applications don't store session data locally, enabling any instance to serve any request. Facilitates horizontal scaling and simple failover. Session state stored in external cache (Redis) or database. Critical pattern for cloud-native HA.

Similar Technologies

Sticky SessionsSession ReplicationClient-side StateToken-based AuthStateful Design

Stateful Application HA

Applications maintain session state requiring session replication or sticky sessions. More complex HA with session affinity to specific nodes. Uses session clustering or distributed caches. Common with legacy applications or applications requiring in-memory state.

Similar Technologies

Stateless DesignSession StoreSession AffinitySession ClusteringDistributed Cache

content_copy

Redundancy Models

N+1 Redundancy

System has N components needed to handle load, plus 1 spare. If one fails, spare takes over with no impact. Cost-effective balance between redundancy and cost. Example: 3 servers handling load that 2 could handle, so 1 failure is tolerated.

Similar Technologies

N+M2NActive-ActiveFull RedundancyMinimal Redundancy

N+M Redundancy

System has N components needed plus M spares, tolerating up to M failures. More resilient than N+1 for critical systems. Example: 5 servers handling load that 3 could manage, tolerating 2 failures. Higher cost but better fault tolerance.

Similar Technologies

N+12NOver-provisioningMultiple SparesHigh Redundancy

2N Redundancy

Complete duplication of all components - twice what's needed for normal operation. Highest availability with full active-active operation. Expensive but eliminates single points of failure. All components actively used, so capacity remains after single failure.

Similar Technologies

N+1N+MFull DuplicationActive-ActiveMirror Configuration

Geographic Redundancy

Resources distributed across multiple geographic locations (availability zones or regions). Protects against datacenter or regional failures. Critical for disaster recovery and global services. Adds complexity with data synchronization and latency considerations.

Similar Technologies

Single Location HAMulti-ZoneMulti-RegionGlobal DistributionGeo-Replication

database

Database HA Patterns

Master-Slave Replication

One master database handles writes, slaves replicate data and handle reads. Simple pattern scaling read capacity. Asynchronous replication may have lag. If master fails, promote slave to master. Common with MySQL, PostgreSQL, MongoDB.

Similar Technologies

Master-MasterMulti-MasterShardingRead ReplicasClustering

Master-Master Replication

Multiple master databases accepting writes, bidirectionally replicating changes. Provides write scalability and HA. Complex conflict resolution needed. Requires careful application design. Used in multi-region active-active architectures.

Similar Technologies

Master-SlaveShardingDistributed DatabaseMulti-Region MasterConflict-free Replicated Data

Database Clustering

Multiple database nodes acting as single system with shared storage or distributed consensus. Automatic failover with no data loss. Examples: Oracle RAC, SQL Server AlwaysOn, PostgreSQL Patroni, MySQL Group Replication. Higher availability than replication.

Similar Technologies

ReplicationShardingDistributed DatabaseHigh-Availability GroupsShared-Nothing Architecture

Read Replicas

Read-only copies of database scaling read queries. Asynchronous replication from master with eventual consistency. Reduces master load for reporting and analytics. Available in RDS, Aurora, Cloud SQL. Can be cross-region for DR.

Similar Technologies

CachingDatabase ShardingCQRSRead-Through CacheQuery Offloading

Distributed Databases

Database distributed across nodes with built-in replication and partitioning. Cassandra, DynamoDB, CockroachDB offer tunable consistency and availability. No single master, peer-to-peer architecture. Scales horizontally with automatic data distribution.

Similar Technologies

Traditional ClusteringMaster-SlaveShardingNewSQL DatabasesMulti-Master

Synchronous vs Asynchronous Replication

Synchronous waits for replicas before commit (zero data loss but higher latency). Asynchronous doesn't wait (lower latency but potential data loss). Choose based on RPO: synchronous for zero RPO, asynchronous for performance. Configurable in most databases.

Similar Technologies

Semi-SynchronousGroup ReplicationLogical ReplicationPhysical ReplicationSnapshot Replication

hub

Cluster Configurations

Quorum-Based Clustering

Cluster requires majority (quorum) of nodes to be operational. Prevents split-brain scenarios where cluster splits into multiple independent groups. Common in etcd, Consul, ZooKeeper. Odd number of nodes recommended (3, 5, 7) for clear majority.

Similar Technologies

Witness NodeFencingSTONITHTiebreakerConsensus Protocol

Split-Brain Prevention

Mechanisms preventing cluster from splitting into independent segments each thinking they're primary. Uses quorum, fencing, STONITH (Shoot The Other Node In The Head), or witness nodes. Critical for data integrity in HA clusters.

Similar Technologies

QuorumFencingArbitratorNetwork Partition HandlingCluster Membership

Heartbeat Monitoring

Nodes send periodic heartbeat messages proving they're alive. Missed heartbeats trigger failover. Configure timeout carefully - too short causes false positives, too long delays failover. Use multiple network paths for heartbeat redundancy.

Similar Technologies

Health ChecksKeepalivedCluster MonitoringFailure DetectionLiveness Probes

Automatic Failover

System automatically detects failures and promotes standby to active without human intervention. Reduces downtime but requires careful testing to avoid false positives. Includes health checks, decision logic, and promotion process. Common in RDS, load balancers, Kubernetes.

Similar Technologies

Manual FailoverSemi-Automatic FailoverOperator InterventionHealth-Check BasedConsensus-Based Failover

Shared Storage Clustering

Multiple nodes access shared storage (SAN, NAS). Active node has exclusive access, standby takes over storage during failure. Simpler than distributed systems but shared storage is potential SPOF. Traditional enterprise HA approach.

Similar Technologies

Shared-NothingDistributed StorageReplicated StorageLocal Storage HACloud Block Storage

Shared-Nothing Architecture

Each node has independent storage and resources with no shared components. Data replicated between nodes. More scalable than shared storage and no single point of failure. Modern cloud-native approach used by distributed databases.

Similar Technologies

Shared StorageShared DiskDistributed DatabaseMicroservicesIndependent Nodes

calculate

SLA & Availability Calculations

99.9% Availability (Three Nines)

43.8 minutes of downtime per month (8.76 hours per year). Standard SLA for many cloud services. Achievable with single region, basic redundancy, and well-tested failover. Suitable for non-critical applications with tolerance for brief outages.

Similar Technologies

99.95%99.99%Lower SLAHigher SLACustom SLA

99.99% Availability (Four Nines)

4.38 minutes of downtime per month (52.6 minutes per year). Requires multi-AZ deployment, automated failover, and robust monitoring. Common target for business-critical applications. Significantly more expensive than 99.9% due to redundancy requirements.

Similar Technologies

99.9%99.999%Three NinesFive NinesEnterprise SLA

99.999% Availability (Five Nines)

26.3 seconds of downtime per month (5.26 minutes per year). Extremely high availability requiring multi-region active-active, automatic failover, and extensive testing. Very expensive to achieve and maintain. Reserved for mission-critical systems.

Similar Technologies

99.99%99.9999%Four NinesSix NinesUltra-High Availability

Composite Availability

Overall system availability calculated from component availabilities. For serial components multiply (A × B), for parallel components use 1 - (1-A)(1-B). Example: Two 99.9% components in series = 99.8%, in parallel = 99.9999%. Design for redundancy to improve composite availability.

Similar Technologies

Single ComponentReliability Block DiagramFault Tree AnalysisAvailability CalculationSystem Reliability

Availability vs Reliability

Availability is uptime percentage (system is operational). Reliability is probability of failure-free operation over time (MTBF). High availability systems can be unreliable if they fail frequently but recover quickly. Both metrics needed for complete picture.

Similar Technologies

Uptime PercentageMTBFDurabilityResilienceFault Tolerance

Planned vs Unplanned Downtime

Planned downtime is scheduled maintenance (deployments, patches). Unplanned is failures and disasters. SLAs may exclude planned downtime. Use blue-green deployments and rolling updates to eliminate planned downtime and improve availability.

Similar Technologies

Total DowntimeMaintenance WindowsZero-Downtime DeploymentsScheduled OutagesEmergency Maintenance

balance

Load Balancer HA

Multi-AZ Load Balancers

Load balancer deployed across multiple availability zones. Automatically routes traffic away from failed AZs. Cloud load balancers (ALB, NLB, Azure LB, Cloud Load Balancing) are multi-AZ by default. Critical for application-level high availability.

Similar Technologies

Single AZ LBMulti-Region LBDNS Load BalancingGlobal Load BalancerRegional LB

Health Checks

Load balancer probes backend targets to verify they're healthy. Unhealthy targets removed from rotation. Configure appropriate check intervals, thresholds, and endpoints. Essential for automatic failover and preventing traffic to failed instances.

Similar Technologies

HeartbeatLiveness ProbeReadiness ProbeTCP CheckHTTP Health Check

Connection Draining

When removing instance from load balancer, allow existing connections to complete before termination. Prevents disruption to active requests during deployments or scaling. Configurable timeout (e.g., 300 seconds). Also called deregistration delay.

Similar Technologies

Immediate TerminationGraceful ShutdownConnection TimeoutRequest CompletionSession Drain

Cross-Zone Load Balancing

Distribute traffic evenly across instances in all enabled AZs, not just within each AZ. Improves distribution but increases cross-AZ data transfer costs. Enabled by default in ALB, optional in NLB. Consider for uneven instance distribution.

Similar Technologies

Single Zone LBAZ-Local BalancingZone-Aware RoutingRegional DistributionGeo-Load Balancing

Global Load Balancing

Route traffic based on geography, latency, or health across multiple regions. Uses DNS-based (Route 53, Traffic Manager, Cloud DNS) or anycast routing. Provides DR failover and performance optimization. Critical for global applications.

Similar Technologies

Regional LBDNS FailoverGeoDNSAnycast RoutingCDN Load Balancing

Sticky Sessions vs Stateless

Sticky sessions route user to same backend (session affinity), required for stateful apps but reduces flexibility. Stateless apps allow any backend, enabling better distribution and simpler failover. Prefer stateless with external session store.

Similar Technologies

Session ReplicationExternal Session StoreClient-Side SessionsRound RobinLeast Connections