In the context of AWS databases, how do the following disaster recovery strategies differ from one another:
- point-in-time recovery
- backup
- snapshot
- Aurora backtrack
When should we choose one over the others?
Why do we need so many different options when one will suffice?
Should we try to use all of them?
3
Answers
One key difference between a manual snapshot and an automatic backup is that a snapshot doesn't expire, whereas an automatic backup are usually stored for a maximum of 35 days.
When you enable automated backups for your AWS database, AWS takes periodic backups of your database and stores them in Amazon S3. These backups serve as the starting point for PITR. AWS keeps transaction logs in S3 for up to 35 days, allowing you to perform point-in-time recovery (PITR) to any point within that timeframe.
When you initiate a PITR restore operation, AWS uses the selected backup and the transaction log to restore your database to the desired point in time. AWS first restores the backup and then applies the relevant transactions from the transaction log to the restored backup. This process brings the database to the desired point in time, allowing you to recover your data as it existed at that time.
Aurora Backtrack allows you to easily undo unintended or incorrect changes to your database by rolling back the database to a specific point in time without needing to restore from a backup. This allows fast rollbacks without the need to create a new database instance. However, Aurora Backtrack has a maximum backtrack window of 72 hours, which means you can only roll back your database to any point in time within the last 72 hours. This is because Aurora Backtrack uses the transaction log to roll back changes, and transaction logs are only kept for 72 hours.
‘Disaster Recovery’ is very old-world. It implies having to fail-over when a problem happens. In the cloud, however, you can focus on High Availability so that systems can recover automatically when there is a failure, without the need to ‘fail-back’ to the original system.
Therefore, the best option is do not do disaster recovery.
Instead, take advantage of the cloud-first design of Amazon Aurora, which automatically replicates data between multiple Availability Zones (each being a different data center).
From High availability for Amazon Aurora – Amazon Aurora:
If you want to use a traditional database instead (eg SQL Server), you can use Amazon RDS to run a Multi-AZ Database. This consists of two databases servers in the same Region but in different Availability Zones (which means different data centers):
If a failure happens with the Primary server, the Secondary server becomes the new Primary server. There is a brief outage, but no data is lost. The RDS service will then launch a new Secondary server.
Failure recovery vs Data recovery
The other options you mention (point-in-time recovery, snapshots) are focussed on recovering data that was in the database at a particular time. This is normally because somebody/something accidentally deleted or changed data and you wish to recovery the data as it was at a previous time. It is good to combine both High Availability and Snapshots, although Amazon Aurora almost makes Snapshots irrelevant due to its ability to go back to a previous point in time.
Bottom line: Instead of Disaster Recovery, think High Availability.
First of all, you need to identify the Recovery Time Objective (RTO) and
Recovery Point Objective (RPO) for
your workload. RTO is the amount of time from a disaster event to when your
system must be fully operational again. RPO is the maximum amount of data loss
that you can tolerate after a disaster event. These objectives help you
determine the appropriate level of risk and cost for your disaster recovery (DR)
plan.
According to AWS
documentation,
there are four main DR strategies that you can use on AWS:
disaster strikes. This is low-cost but high-risk, as it has a high RTO and
RPO.
scale up when needed. This reduces the RTO and RPO but requires some manual
intervention.
that can handle minimal traffic. This allows you to switch over quickly with
minimal downtime. This further reduces the RTO and RPO but increases the cost
and complexity.
load balancing and synchronization. This provides the highest availability
and resilience, as well as the lowest RTO and RPO possible. However, this
also requires the most cost and complexity.
Your question only focuses on different backup and restore strategies. They are
all different ways of restoring your database state from a specific point in
time using AWS services such as Amazon Relational Database Service (RDS), Amazon
Aurora, or Amazon DynamoDB.
However, these options do not cover other aspects of DR such as scaling up
resources, switching over traffic, or synchronizing data across Regions. Some
services like AWS Aurora natively support multi-site active/active DR, but
others like RDS do not. Therefore, you need to first focus on the RTO and RPO
objectives for your workload before choosing a DR strategy. Also please refer to
Disaster Recovery on AWS.