Will Percey — Portfolio

Disaster Recovery

> > Updated Dec 2025

shield

Core Concepts

RTO (Recovery Time Objective)

Maximum acceptable length of time that a system can be down after a failure or disaster. RTO defines how quickly systems must be restored. For example, RTO of 4 hours means systems must be back online within 4 hours of an outage.

Similar Technologies

MTTR (Mean Time To Recover)Downtime WindowService Level AgreementRecovery WindowMaximum Tolerable Downtime

RPO (Recovery Point Objective)

Maximum acceptable amount of data loss measured in time. RPO defines how much data you can afford to lose. For example, RPO of 15 minutes means you can lose at most 15 minutes of data, requiring backups or replication every 15 minutes.

Similar Technologies

Backup FrequencyData Loss ToleranceSnapshot IntervalReplication LagPoint-in-Time Recovery

MTTR (Mean Time To Recover)

Average time required to repair and restore a system to operational status after a failure. MTTR includes detection time, diagnosis time, repair time, and verification time. Lower MTTR indicates better recovery processes and automation.

Similar Technologies

RTORecovery TimeRepair TimeRestoration TimeDowntime Duration

MTBF (Mean Time Between Failures)

Average time between system failures, measuring system reliability. Higher MTBF indicates more reliable systems. Used to predict failure rates and plan maintenance windows. Combined with MTTR to calculate overall availability.

Similar Technologies

Uptime PercentageFailure RateReliability MetricAvailability PercentageSystem Reliability

backup

Backup Strategies

Full Backup

Complete copy of all data at a specific point in time. Provides fastest restore time since all data is in one location, but takes longest to create and requires most storage space. Typically performed weekly or monthly as baseline for incremental backups.

Similar Technologies

Complete System ImageSnapshot BackupClone BackupArchive BackupMirror Backup

Incremental Backup

Only backs up data that has changed since the last backup (full or incremental). Fastest backup time and uses least storage, but slower to restore as multiple backup sets must be applied sequentially. Commonly used for daily backups.

Similar Technologies

Differential BackupChanged Block TrackingDelta BackupSynthetic FullForever Incremental

Differential Backup

Backs up all data changed since the last full backup. Faster to restore than incremental (only need full + latest differential), but slower to create and uses more storage than incremental. Good balance between backup and restore speed.

Similar Technologies

Incremental BackupCumulative IncrementalModified Files OnlyChanged Data BackupPartial Backup

Continuous Data Protection

Real-time or near-real-time backup capturing every change to data as it occurs. Provides very low RPO (seconds or less) and enables point-in-time recovery to any moment. Used for mission-critical databases and systems requiring minimal data loss.

Similar Technologies

Real-time ReplicationJournal-based BackupSynchronous ReplicationChange Data CaptureTransaction Log Shipping

Snapshot Backup

Point-in-time copy of data using copy-on-write or redirect-on-write technology. Creates instant backup without copying all data initially. Efficient for virtual machines and storage arrays. Space-efficient but dependent on source storage.

Similar Technologies

Storage SnapshotVM SnapshotVolume Shadow CopyLVM SnapshotZFS Snapshot

Cloud Backup

Offsite backup to cloud storage for geographic redundancy and disaster recovery. Provides scalability, durability (11 9's with S3), and pay-as-you-go pricing. Supports versioning, encryption, and compliance. Common with AWS S3, Azure Blob, GCS.

Similar Technologies

Offsite TapeRemote DatacenterBackup as a ServiceCloud-to-Cloud BackupHybrid Cloud Backup

schedule

Backup Rotation Schemes

3-2-1 Backup Rule

Best practice requiring 3 copies of data, on 2 different media types, with 1 copy offsite. For example: production data + local backup + cloud backup on different storage technologies. Protects against hardware failure, site disasters, and data corruption.

Similar Technologies

3-2-1-1-0 Rule4-3-2 RuleMulti-site BackupGeographic DistributionCloud-first Strategy

Grandfather-Father-Son (GFS)

Hierarchical rotation with daily (son), weekly (father), and monthly (grandfather) backups retained for different periods. For example: 7 daily, 4 weekly, 12 monthly. Balances retention requirements with storage costs while meeting compliance needs.

Similar Technologies

Tower of HanoiSimple RotationCustom RetentionTiered BackupLifecycle Policies

Tower of Hanoi

Mathematical rotation scheme using powers of 2 for backup scheduling. Provides exponential retention with efficient media usage. More complex than GFS but optimizes tape/media rotation. Rarely used in modern cloud environments but efficient for tape systems.

Similar Technologies

GFS SchemeSimple RotationLinear RotationFibonacci BackupCustom Rotation

First In First Out (FIFO)

Simple rotation where oldest backup is overwritten first. Easy to understand and implement but doesn't provide varied retention periods. Best for simple scenarios with fixed retention requirements like 30-day rolling backups.

Similar Technologies

GFSLRU (Least Recently Used)Circular BufferRolling BackupTime-based Retention

location_on

DR Site Types

Hot Site

Fully operational duplicate environment with real-time or near-real-time data replication. Can take over immediately (RTO minutes), providing highest availability. Most expensive option with duplicate infrastructure, active database replication, and continuous synchronization.

Similar Technologies

Active-ActiveMulti-Region ActiveSynchronous ReplicationZero-Downtime DRReal-time Failover

Warm Site

Partially configured environment with infrastructure ready but not fully synchronized. Systems are running but may need data restoration and configuration updates (RTO hours). Balance between cost and recovery speed. Common for non-critical systems.

Similar Technologies

Standby SitePilot LightPre-staged InfrastructureSemi-active DRReduced Scale Standby

Cold Site

Facility with power, cooling, and network connectivity but no pre-installed hardware or data. Requires hardware procurement, installation, and full data restore (RTO days/weeks). Lowest cost option suitable for non-time-critical systems and data archives.

Similar Technologies

Backup SiteShell SiteEmpty DatacenterDisaster Recovery SpaceEmergency Site

Pilot Light

Minimal version of environment with core systems running (like database replicas) but application servers off. Can scale up quickly when needed (RTO 10s of minutes to hours). Cost-effective approach for AWS/Azure/GCP using minimal compute until disaster.

Similar Technologies

Warm StandbyMinimal StandbyCore Systems ActiveQuick RecoveryScalable DR

Backup & Restore

Most basic DR strategy relying on regular backups to cloud or offsite storage. Infrastructure provisioned only during disaster. Longest RTO (hours to days) but lowest cost. Suitable for non-critical systems with relaxed recovery requirements.

Similar Technologies

Cold BackupArchive RecoveryFrom-Scratch RebuildBackup-only StrategyCloud Restore

Multi-Region Active-Active

Fully redundant production environments in multiple geographic regions serving traffic simultaneously. Near-zero RTO with automatic failover. Highest cost but provides both DR and performance benefits through geographic load distribution and lowest latency.

Similar Technologies

Hot SiteGeo-DistributedGlobal Load BalancingMulti-MasterActive Everywhere

science

DR Testing & Procedures

Tabletop Exercise

Discussion-based DR test where team walks through recovery procedures without actual failover. Low-risk way to identify gaps in documentation, clarify roles, and improve processes. Typically done quarterly. Focuses on communication and decision-making.

Similar Technologies

Paper TestWalkthroughDisaster SimulationPlanning ReviewScenario Discussion

Simulation Testing

Practice recovery procedures in test environment mimicking production. Tests backup restoration, system recovery, and team coordination without impacting production. Validates RTOs and RPOs. More realistic than tabletop but less risky than full test.

Similar Technologies

Mock DisasterNon-Production TestDR RehearsalRecovery DrillControlled Test

Parallel Testing

Recovery systems run alongside production without cutover. Validates backup integrity, recovery procedures, and system functionality without production impact. Tests RTO achievement and application functionality in DR environment.

Similar Technologies

Shadow TestingDual OperationSide-by-side TestNon-disruptive TestVerification Test

Full Interruption Test

Complete failover to DR site with actual production cutover during maintenance window. Most realistic test proving DR capability end-to-end. High-risk but validates actual recovery including network switching, DNS changes, and user access. Done annually.

Similar Technologies

Complete FailoverProduction CutoverReal DR EventLive Failover TestTotal Test

Failover Procedure

Documented process for switching from primary to secondary site during disaster. Includes trigger conditions, authorization workflow, technical steps (DNS changes, load balancer updates, database promotion), validation checkpoints, and communication protocols.

Similar Technologies

Cutover ProcessSite SwitchDR ActivationDisaster DeclarationEmergency Failover

Failback Procedure

Process for returning to primary site after disaster recovery. Often more complex than failover requiring data synchronization from DR to primary, reverse replication, testing, and scheduled cutback. Must avoid data loss during transition.

Similar Technologies

Recovery ProcedureReturn to NormalSite RestorationPrimary Site RecoveryCutback Process

public

Multi-Region DR Patterns

Active-Passive (Backup & Restore)

Primary region serves all traffic with regular backups to secondary region. During disaster, restore from backup and redirect traffic. Longest RTO but lowest cost. Good for non-critical systems. RPO depends on backup frequency.

Similar Technologies

Cold DRBackup StrategySingle Active RegionRestore from BackupPassive Standby

Active-Passive (Pilot Light)

Core infrastructure (databases with replication) running in secondary region with application servers stopped. During disaster, launch servers and redirect traffic. Medium RTO (minutes to hours) and cost. Common pattern for cost-effective DR.

Similar Technologies

Minimal DRDatabase StandbyCore Systems ReadyQuick Start DRInfrastructure Ready

Active-Passive (Warm Standby)

Fully functional secondary region running at reduced capacity. During disaster, scale up and redirect traffic. Lower RTO (minutes) than pilot light. Can also serve read-only queries to reduce cost. Balances cost and recovery speed.

Similar Technologies

Scaled-down ActivePartial StandbyReady ReserveReduced ScaleQuick Scale DR

Active-Active (Multi-Region)

Both regions serve production traffic simultaneously with geographic load balancing (Route 53, Traffic Manager). Near-zero RTO with automatic failover. Highest cost but provides performance benefits. Requires data synchronization and conflict resolution.

Similar Technologies

Multi-MasterGeo-DistributedGlobal ActiveDual ActiveAlways On

Database Replication Patterns

Asynchronous or synchronous replication between regions. Asynchronous has lower cost and latency but potential data loss. Synchronous guarantees zero data loss but adds latency. Choose based on RPO requirements. Use RDS, Aurora Global, Cosmos DB global replication.

Similar Technologies

Master-SlaveLog ShippingChange Data CaptureBidirectional ReplicationMulti-Master Sync

DNS Failover

Health checks monitor primary region and automatically update DNS records to point to DR region during failure. Simple but DNS propagation causes delay. Use Route 53 health checks, Traffic Manager, or Cloud DNS with low TTL values for faster switchover.

Similar Technologies

Load Balancer FailoverGlobal Load BalancerAnycast RoutingBGP FailoverManual DNS Update