Disaster Recovery in an AWS Microservice environment

Jack Johnson
3 min readApr 26, 2023

--

Introduction

In the event of a major outage or even a single component outage, ensuring that the system can be restored back to business as usual, smoothly, quickly and without loss of data is crucial. In traditional monolithic solutions a disaster recovery plan is fairly straight forward, having code and data in a single place means that the system can restore from point in time backups and by redeployment from the source code. When working in microservice architecture, code and data are designed to be isolated, to improve resiliency and reduce risk of disaster, making it harder to follow traditional DR approaches.

Purpose

In this example, the solution is being built in an active/active (multi-region) architecture on AWS. The solution’s traffic is flowing to duplicated deployed solutions across multiple regions, simultaneously. This allows for routing of traffic to be isolated in chosen regions, where needed.

This example does not reflect an exact microservice architecture, but shows how traffic is routed in an active/active environment, through a web application.

Generally this approach is more complex and can be a more costly approach to disaster recovery due to having to manage multiple regions, but it can reduce the recovery time to near zero for most disasters.

If we adopt true microservice architecture and utilise FaaS (Functions as a service), the cost of the solution is predominantly in it’s runtime, therefore cost difference of running an active/active solution versus an active/passive solution are minimal.

Most of the disaster recovery actions can be automatic and require little manual intervention to execute, however investigations and fixes must be performed to recover the solution post mitigation of the disaster. The aim should always be to recover the solution back to active/active.

Definition of Terms

Two key metrics of a DR solution are Recovery Point Objective (RPO) and Recovery Time Objective (RTO). In mission-critical apps, both are extremely critical and need to be tuned differently for different use-cases.

Roles and Responsibilities

An example of personnel that may be responsible for using and implementing this DR plan are:

Scenarios

The following scenarios do not cover failures caused by changes made by personal or functional code. Breaking changes by AWS services are also not covered by these scenarios as these should be communicated and updated appropriately.

In all scenarios, the front end should gracefully communicate the failures with appropriate ways for users to report and resolved issues.

Risk Profile

--

--

Jack Johnson
Jack Johnson

Written by Jack Johnson

Learn more about the public cloud, from basics to building complex solutions, from the perspective of a Cloud Engineer, come Technology Leader.

No responses yet