- Disaster recovery
Disaster recovery is the process of resuming normal operations after a disaster by regaining access to data, hardware, software, networking equipment, power, and connectivity. Disaster recovery is a subset of business continuity.
A major disaster may have occurred. This may include:
- AWS Region being down
- AWS account being compromised and resources deleted
This is the process of recovering the application in a different AWS account and/or region to the one currently deployed.
- Texas sets up a new AWS account/region to the one currently used.
- Terraform prerequisites are set up in the new AWS account/region. This includes:
- S3 bucket for Terraform state
- DynamoDB table for Terraform state locking
- Manually set up ECR repositories in the new AWS management account/region. This is because ECR repositories haven't been set up in Terraform yet.
- Run Terraform in the new AWS account/region. This will create all the resources needed to run the application.
- Run the Terraform deploying the shared resources first.
- Deploy the first blue/green environment.
- (Optional) Deploy the second blue/green environment.
- Deploy the blue/green link Terraform stack.
- Give new API Key to Profile Manager/NHS.UK team.
This is the process of recovering the application within the same AWS account and region as currently deployed.
- (if needed) Terraform prerequisites are set up in the AWS account/region. This includes:
- S3 bucket for Terraform state
- DynamoDB table for Terraform state locking
- (if needed) Manually set up ECR repositories in the AWS management account/region. This is because ECR repositories haven't been set up in Terraform yet.
- Run Terraform in the current. Terraform will create any non-existent resources needed to run the application.
- Run the Terraform deploying the shared resources first.
- Deploy the first blue/green environment.
- (Optional) Deploy the second blue/green environment.
- Deploy the blue/green link Terraform stack.
- (if needed) Give new API Key to Profile Manager/NHS.UK team.
DynamoDB data is backed up using point in time recovery. The data can be restored to any point in time within the last 35 days. This is documented in confluence here.
This is a suggestion for a disaster recovery plan. It's not a complete plan and should be used as a starting point for a full disaster recovery plan. As well this plan hasn't been tested yet.
It's likely that problems will be discovered when applying this plan.