Disaster Recovery
Core Concepts
Maximum acceptable length of time that a system can be down after a failure or disaster. RTO defines how quickly systems must be restored. For example, RTO of 4 hours means systems must be back online within 4 hours of an outage.
Maximum acceptable amount of data loss measured in time. RPO defines how much data you can afford to lose. For example, RPO of 15 minutes means you can lose at most 15 minutes of data, requiring backups or replication every 15 minutes.
Average time required to repair and restore a system to operational status after a failure. MTTR includes detection time, diagnosis time, repair time, and verification time. Lower MTTR indicates better recovery processes and automation.
Average time between system failures, measuring system reliability. Higher MTBF indicates more reliable systems. Used to predict failure rates and plan maintenance windows. Combined with MTTR to calculate overall availability.
Backup Strategies
Complete copy of all data at a specific point in time. Provides fastest restore time since all data is in one location, but takes longest to create and requires most storage space. Typically performed weekly or monthly as baseline for incremental backups.
Only backs up data that has changed since the last backup (full or incremental). Fastest backup time and uses least storage, but slower to restore as multiple backup sets must be applied sequentially. Commonly used for daily backups.
Backs up all data changed since the last full backup. Faster to restore than incremental (only need full + latest differential), but slower to create and uses more storage than incremental. Good balance between backup and restore speed.
Real-time or near-real-time backup capturing every change to data as it occurs. Provides very low RPO (seconds or less) and enables point-in-time recovery to any moment. Used for mission-critical databases and systems requiring minimal data loss.
Point-in-time copy of data using copy-on-write or redirect-on-write technology. Creates instant backup without copying all data initially. Efficient for virtual machines and storage arrays. Space-efficient but dependent on source storage.
Offsite backup to cloud storage for geographic redundancy and disaster recovery. Provides scalability, durability (11 9's with S3), and pay-as-you-go pricing. Supports versioning, encryption, and compliance. Common with AWS S3, Azure Blob, GCS.
Backup Rotation Schemes
Best practice requiring 3 copies of data, on 2 different media types, with 1 copy offsite. For example: production data + local backup + cloud backup on different storage technologies. Protects against hardware failure, site disasters, and data corruption.
Hierarchical rotation with daily (son), weekly (father), and monthly (grandfather) backups retained for different periods. For example: 7 daily, 4 weekly, 12 monthly. Balances retention requirements with storage costs while meeting compliance needs.
Mathematical rotation scheme using powers of 2 for backup scheduling. Provides exponential retention with efficient media usage. More complex than GFS but optimizes tape/media rotation. Rarely used in modern cloud environments but efficient for tape systems.
Simple rotation where oldest backup is overwritten first. Easy to understand and implement but doesn't provide varied retention periods. Best for simple scenarios with fixed retention requirements like 30-day rolling backups.
DR Site Types
Fully operational duplicate environment with real-time or near-real-time data replication. Can take over immediately (RTO minutes), providing highest availability. Most expensive option with duplicate infrastructure, active database replication, and continuous synchronization.
Partially configured environment with infrastructure ready but not fully synchronized. Systems are running but may need data restoration and configuration updates (RTO hours). Balance between cost and recovery speed. Common for non-critical systems.
Facility with power, cooling, and network connectivity but no pre-installed hardware or data. Requires hardware procurement, installation, and full data restore (RTO days/weeks). Lowest cost option suitable for non-time-critical systems and data archives.
Minimal version of environment with core systems running (like database replicas) but application servers off. Can scale up quickly when needed (RTO 10s of minutes to hours). Cost-effective approach for AWS/Azure/GCP using minimal compute until disaster.
Most basic DR strategy relying on regular backups to cloud or offsite storage. Infrastructure provisioned only during disaster. Longest RTO (hours to days) but lowest cost. Suitable for non-critical systems with relaxed recovery requirements.
Fully redundant production environments in multiple geographic regions serving traffic simultaneously. Near-zero RTO with automatic failover. Highest cost but provides both DR and performance benefits through geographic load distribution and lowest latency.
DR Testing & Procedures
Discussion-based DR test where team walks through recovery procedures without actual failover. Low-risk way to identify gaps in documentation, clarify roles, and improve processes. Typically done quarterly. Focuses on communication and decision-making.
Practice recovery procedures in test environment mimicking production. Tests backup restoration, system recovery, and team coordination without impacting production. Validates RTOs and RPOs. More realistic than tabletop but less risky than full test.
Recovery systems run alongside production without cutover. Validates backup integrity, recovery procedures, and system functionality without production impact. Tests RTO achievement and application functionality in DR environment.
Complete failover to DR site with actual production cutover during maintenance window. Most realistic test proving DR capability end-to-end. High-risk but validates actual recovery including network switching, DNS changes, and user access. Done annually.
Documented process for switching from primary to secondary site during disaster. Includes trigger conditions, authorization workflow, technical steps (DNS changes, load balancer updates, database promotion), validation checkpoints, and communication protocols.
Process for returning to primary site after disaster recovery. Often more complex than failover requiring data synchronization from DR to primary, reverse replication, testing, and scheduled cutback. Must avoid data loss during transition.
Multi-Region DR Patterns
Primary region serves all traffic with regular backups to secondary region. During disaster, restore from backup and redirect traffic. Longest RTO but lowest cost. Good for non-critical systems. RPO depends on backup frequency.
Core infrastructure (databases with replication) running in secondary region with application servers stopped. During disaster, launch servers and redirect traffic. Medium RTO (minutes to hours) and cost. Common pattern for cost-effective DR.
Fully functional secondary region running at reduced capacity. During disaster, scale up and redirect traffic. Lower RTO (minutes) than pilot light. Can also serve read-only queries to reduce cost. Balances cost and recovery speed.
Both regions serve production traffic simultaneously with geographic load balancing (Route 53, Traffic Manager). Near-zero RTO with automatic failover. Highest cost but provides performance benefits. Requires data synchronization and conflict resolution.
Asynchronous or synchronous replication between regions. Asynchronous has lower cost and latency but potential data loss. Synchronous guarantees zero data loss but adds latency. Choose based on RPO requirements. Use RDS, Aurora Global, Cosmos DB global replication.
Health checks monitor primary region and automatically update DNS records to point to DR region during failure. Simple but DNS propagation causes delay. Use Route 53 health checks, Traffic Manager, or Cloud DNS with low TTL values for faster switchover.
