Disaster Recovery

shield

Core Concepts

RTO (Recovery Time Objective)

Maximum acceptable length of time that a system can be down after a failure or disaster. RTO defines how quickly systems must be restored. For example, RTO of 4 hours means systems must be back online within 4 hours of an outage.

Similar Technologies
MTTR (Mean Time To Recover)Downtime WindowService Level AgreementRecovery WindowMaximum Tolerable Downtime
RPO (Recovery Point Objective)

Maximum acceptable amount of data loss measured in time. RPO defines how much data you can afford to lose. For example, RPO of 15 minutes means you can lose at most 15 minutes of data, requiring backups or replication every 15 minutes.

Similar Technologies
Backup FrequencyData Loss ToleranceSnapshot IntervalReplication LagPoint-in-Time Recovery
MTTR (Mean Time To Recover)

Average time required to repair and restore a system to operational status after a failure. MTTR includes detection time, diagnosis time, repair time, and verification time. Lower MTTR indicates better recovery processes and automation.

Similar Technologies
RTORecovery TimeRepair TimeRestoration TimeDowntime Duration
MTBF (Mean Time Between Failures)

Average time between system failures, measuring system reliability. Higher MTBF indicates more reliable systems. Used to predict failure rates and plan maintenance windows. Combined with MTTR to calculate overall availability.

Similar Technologies
Uptime PercentageFailure RateReliability MetricAvailability PercentageSystem Reliability
backup

Backup Strategies

Full Backup

Complete copy of all data at a specific point in time. Provides fastest restore time since all data is in one location, but takes longest to create and requires most storage space. Typically performed weekly or monthly as baseline for incremental backups.

Similar Technologies
Complete System ImageSnapshot BackupClone BackupArchive BackupMirror Backup
Incremental Backup

Only backs up data that has changed since the last backup (full or incremental). Fastest backup time and uses least storage, but slower to restore as multiple backup sets must be applied sequentially. Commonly used for daily backups.

Similar Technologies
Differential BackupChanged Block TrackingDelta BackupSynthetic FullForever Incremental
Differential Backup

Backs up all data changed since the last full backup. Faster to restore than incremental (only need full + latest differential), but slower to create and uses more storage than incremental. Good balance between backup and restore speed.

Similar Technologies
Incremental BackupCumulative IncrementalModified Files OnlyChanged Data BackupPartial Backup
Continuous Data Protection

Real-time or near-real-time backup capturing every change to data as it occurs. Provides very low RPO (seconds or less) and enables point-in-time recovery to any moment. Used for mission-critical databases and systems requiring minimal data loss.

Similar Technologies
Real-time ReplicationJournal-based BackupSynchronous ReplicationChange Data CaptureTransaction Log Shipping
Snapshot Backup

Point-in-time copy of data using copy-on-write or redirect-on-write technology. Creates instant backup without copying all data initially. Efficient for virtual machines and storage arrays. Space-efficient but dependent on source storage.

Similar Technologies
Storage SnapshotVM SnapshotVolume Shadow CopyLVM SnapshotZFS Snapshot
Cloud Backup

Offsite backup to cloud storage for geographic redundancy and disaster recovery. Provides scalability, durability (11 9's with S3), and pay-as-you-go pricing. Supports versioning, encryption, and compliance. Common with AWS S3, Azure Blob, GCS.

Similar Technologies
Offsite TapeRemote DatacenterBackup as a ServiceCloud-to-Cloud BackupHybrid Cloud Backup
schedule

Backup Rotation Schemes

3-2-1 Backup Rule

Best practice requiring 3 copies of data, on 2 different media types, with 1 copy offsite. For example: production data + local backup + cloud backup on different storage technologies. Protects against hardware failure, site disasters, and data corruption.

Similar Technologies
3-2-1-1-0 Rule4-3-2 RuleMulti-site BackupGeographic DistributionCloud-first Strategy
Grandfather-Father-Son (GFS)

Hierarchical rotation with daily (son), weekly (father), and monthly (grandfather) backups retained for different periods. For example: 7 daily, 4 weekly, 12 monthly. Balances retention requirements with storage costs while meeting compliance needs.

Similar Technologies
Tower of HanoiSimple RotationCustom RetentionTiered BackupLifecycle Policies
Tower of Hanoi

Mathematical rotation scheme using powers of 2 for backup scheduling. Provides exponential retention with efficient media usage. More complex than GFS but optimizes tape/media rotation. Rarely used in modern cloud environments but efficient for tape systems.

Similar Technologies
GFS SchemeSimple RotationLinear RotationFibonacci BackupCustom Rotation
First In First Out (FIFO)

Simple rotation where oldest backup is overwritten first. Easy to understand and implement but doesn't provide varied retention periods. Best for simple scenarios with fixed retention requirements like 30-day rolling backups.

Similar Technologies
GFSLRU (Least Recently Used)Circular BufferRolling BackupTime-based Retention
location_on

DR Site Types

Hot Site

Fully operational duplicate environment with real-time or near-real-time data replication. Can take over immediately (RTO minutes), providing highest availability. Most expensive option with duplicate infrastructure, active database replication, and continuous synchronization.

Similar Technologies
Active-ActiveMulti-Region ActiveSynchronous ReplicationZero-Downtime DRReal-time Failover
Warm Site

Partially configured environment with infrastructure ready but not fully synchronized. Systems are running but may need data restoration and configuration updates (RTO hours). Balance between cost and recovery speed. Common for non-critical systems.

Similar Technologies
Standby SitePilot LightPre-staged InfrastructureSemi-active DRReduced Scale Standby
Cold Site

Facility with power, cooling, and network connectivity but no pre-installed hardware or data. Requires hardware procurement, installation, and full data restore (RTO days/weeks). Lowest cost option suitable for non-time-critical systems and data archives.

Similar Technologies
Backup SiteShell SiteEmpty DatacenterDisaster Recovery SpaceEmergency Site
Pilot Light

Minimal version of environment with core systems running (like database replicas) but application servers off. Can scale up quickly when needed (RTO 10s of minutes to hours). Cost-effective approach for AWS/Azure/GCP using minimal compute until disaster.

Similar Technologies
Warm StandbyMinimal StandbyCore Systems ActiveQuick RecoveryScalable DR
Backup & Restore

Most basic DR strategy relying on regular backups to cloud or offsite storage. Infrastructure provisioned only during disaster. Longest RTO (hours to days) but lowest cost. Suitable for non-critical systems with relaxed recovery requirements.

Similar Technologies
Cold BackupArchive RecoveryFrom-Scratch RebuildBackup-only StrategyCloud Restore
Multi-Region Active-Active

Fully redundant production environments in multiple geographic regions serving traffic simultaneously. Near-zero RTO with automatic failover. Highest cost but provides both DR and performance benefits through geographic load distribution and lowest latency.

Similar Technologies
Hot SiteGeo-DistributedGlobal Load BalancingMulti-MasterActive Everywhere
science

DR Testing & Procedures

Tabletop Exercise

Discussion-based DR test where team walks through recovery procedures without actual failover. Low-risk way to identify gaps in documentation, clarify roles, and improve processes. Typically done quarterly. Focuses on communication and decision-making.

Similar Technologies
Paper TestWalkthroughDisaster SimulationPlanning ReviewScenario Discussion
Simulation Testing

Practice recovery procedures in test environment mimicking production. Tests backup restoration, system recovery, and team coordination without impacting production. Validates RTOs and RPOs. More realistic than tabletop but less risky than full test.

Similar Technologies
Mock DisasterNon-Production TestDR RehearsalRecovery DrillControlled Test
Parallel Testing

Recovery systems run alongside production without cutover. Validates backup integrity, recovery procedures, and system functionality without production impact. Tests RTO achievement and application functionality in DR environment.

Similar Technologies
Shadow TestingDual OperationSide-by-side TestNon-disruptive TestVerification Test
Full Interruption Test

Complete failover to DR site with actual production cutover during maintenance window. Most realistic test proving DR capability end-to-end. High-risk but validates actual recovery including network switching, DNS changes, and user access. Done annually.

Similar Technologies
Complete FailoverProduction CutoverReal DR EventLive Failover TestTotal Test
Failover Procedure

Documented process for switching from primary to secondary site during disaster. Includes trigger conditions, authorization workflow, technical steps (DNS changes, load balancer updates, database promotion), validation checkpoints, and communication protocols.

Similar Technologies
Cutover ProcessSite SwitchDR ActivationDisaster DeclarationEmergency Failover
Failback Procedure

Process for returning to primary site after disaster recovery. Often more complex than failover requiring data synchronization from DR to primary, reverse replication, testing, and scheduled cutback. Must avoid data loss during transition.

Similar Technologies
Recovery ProcedureReturn to NormalSite RestorationPrimary Site RecoveryCutback Process
public

Multi-Region DR Patterns

Active-Passive (Backup & Restore)

Primary region serves all traffic with regular backups to secondary region. During disaster, restore from backup and redirect traffic. Longest RTO but lowest cost. Good for non-critical systems. RPO depends on backup frequency.

Similar Technologies
Cold DRBackup StrategySingle Active RegionRestore from BackupPassive Standby
Active-Passive (Pilot Light)

Core infrastructure (databases with replication) running in secondary region with application servers stopped. During disaster, launch servers and redirect traffic. Medium RTO (minutes to hours) and cost. Common pattern for cost-effective DR.

Similar Technologies
Minimal DRDatabase StandbyCore Systems ReadyQuick Start DRInfrastructure Ready
Active-Passive (Warm Standby)

Fully functional secondary region running at reduced capacity. During disaster, scale up and redirect traffic. Lower RTO (minutes) than pilot light. Can also serve read-only queries to reduce cost. Balances cost and recovery speed.

Similar Technologies
Scaled-down ActivePartial StandbyReady ReserveReduced ScaleQuick Scale DR
Active-Active (Multi-Region)

Both regions serve production traffic simultaneously with geographic load balancing (Route 53, Traffic Manager). Near-zero RTO with automatic failover. Highest cost but provides performance benefits. Requires data synchronization and conflict resolution.

Similar Technologies
Multi-MasterGeo-DistributedGlobal ActiveDual ActiveAlways On
Database Replication Patterns

Asynchronous or synchronous replication between regions. Asynchronous has lower cost and latency but potential data loss. Synchronous guarantees zero data loss but adds latency. Choose based on RPO requirements. Use RDS, Aurora Global, Cosmos DB global replication.

Similar Technologies
Master-SlaveLog ShippingChange Data CaptureBidirectional ReplicationMulti-Master Sync
DNS Failover

Health checks monitor primary region and automatically update DNS records to point to DR region during failure. Simple but DNS propagation causes delay. Use Route 53 health checks, Traffic Manager, or Cloud DNS with low TTL values for faster switchover.

Similar Technologies
Load Balancer FailoverGlobal Load BalancerAnycast RoutingBGP FailoverManual DNS Update