We all hope and pray that our systems will work all the time, everything will function as expected and we will never have outages. But that is not reality. Outage happen, all the time. Hosts go down, hosts fail, there are network outages, connectivity problems, this is something that you should design your applications to able to tolerate. Understanding the basics of an AWS backup and recovery is crucial to these situations.
But what happens in the case of a catastrophic failure?
- Your whole datacenter goes down.
- AWS loses a complete region
- Your account is compromised and someone starts deleting (or steals) your resources.
Logical and Physical Segregation
In order to understand how you to design for an underlying infrastructure failure – you will first have to understand how these are implemented in AWS.
AWS is divided into 11 different regions – which are located in the North America, South America. Europe and Asia Pacific.
|us-east-1||US East (N. Virginia)|
|us-east-2||US East (Ohio)|
|us-west-1||US West (N. California)|
|us-west-2||US West (Oregon)|
|ap-northeast-1||Asia Pacific (Tokyo)|
|ap-northeast-2||Asia Pacific (Seoul)|
|ap-southeast-1||Asia Pacific (Singapore)|
|ap-southeast-2||Asia Pacific (Sydney)|
|ap-south-1||Asia Pacific (Mumbai)|
|sa-east-1||South America (São Paulo)|
Each Amazon EC2 region is completely isolated from the other Amazon EC2 regions, the reason for this being that is provide you with the greatest possible fault tolerance and stability. The resources presented to you will be only those that are deployed in that specific region, and resources are not replicated across regions by default.
Communication between regions is done over the public internet.
Each Availability Zone is isolated, but the Availability Zones in a region are connected through low-latency links. The following diagram illustrates the relationship between regions and Availability Zones.
Each Availability Zone is represented by a region code followed by a letter identifier; for example, us-west-1a. As a result of distribution of resources across the Availability Zones in any given region, AWS maps each zone to a name – in a way that they are not the same for each and every account, this allows for dispersion and balance across the zones. For example, your us-west-1a will not be the same as the us-west-1a from another account (even if it is your own) and therefore you cannot assume that the zone names will be identical across accounts.
It is important to note that traffic between availability zones within a single region are connected with high-speed low latency networks. There is a cost that comes with traffic across regions and availability zones, – and that must be factored into your design.
Resources are dispersed across AWS in the following way.
|Key pairs||Global or Regional|
|Amazon EC2 resource identifiers||Regional|
|User-supplied resource names||Regional|
|Elastic IP addresses||Regional|
|EBS volumes||Availability Zone|
(Adapted from – AWS)
Protect yourself from Failure
AWS already provides you with the basic tenets of segregation that allows you to design and deploy your applications in the cloud.
Here are a few elementary design considerations that you should address when deploying in the cloud (assuming that your application can support them)
- Data should be local to where you application resides.
For example – your database where customer information should be (at minimum) in the same region that your application resides. Spreading your applications over multiple regions – will induce latency, slow response times and require a much more complicated deployment.
- Make use Availability zones for redundancy.
As was mentioned before each zone is isolated from one another and therefore it i highly recommended to utilize the fact that workloads can be deployed across multiple availability zones – that provide an additional layer of protection for your applications.
- Orchestration is key.
Manual deployments of instances, services and applications, is a tedious and error-prone. Your deployments should be automated and the code that provisions these services should be stored in a code repository.
- Security is vital
Access to any and all services within your cloud should be defined with RBAC (Role Based Access Control). AWS IAM is a granular solution that allows assigning very specific roles and permissions for access and management of your cloud.
Total disaster recovery
I am always reminded of the story of Code Spaces. You will no longer find that they have a company presence – because they went out of business. Code Spaces based their entire business upon AWS, from their day-to-day operations – to their backups, replication, basically everything.
And their account was compromised. Someone got hold of the keys to the kingdom – and demanded a ransom – and when that ransom was not paid – they began to destroy all the data, backups and all information that was available to them.
In the enterprise world – there are usually strict regulations – and separation of duties between the different roles in the organization, network, storage, compute, application.
It might seem very convenient to have all of these roles under one umbrella, under one hat in the cloud – but there can be severe consequences in the event of a breach – sometimes a catastrophic result.
The following is recommended:
- Backups should not be stored under the same account used to run your production workloads. Users, API-keys should be separate
- Monitor usage and access to your credentials.
Ensure that your cloud provider does enable you to monitor changes made to your User management (AWS already does – http://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/auth-and-access-control-cw.html).
- Run regular fire drills
A disaster recover plan is only viable if you can test that it works. You should incorporate a regular test to see how comprehensive a plan you have documented, where and if there are shortcomings in your initial design, and also how long it will take to recover from a catastrophic failure.
- Make use of more than one cloud provider
Spreading your eggs in more than basket – reduces your exposure to having all of your business compromised as a result of a breach. Invest time and resources in spreading both your running workloads, or at a minimum storing your backups data on a different cloud provider to reduce your exposure.
- Infrastructure as code
By storing the way that you build your applications, users, roles, replication, databases etc. in a source code repository – will allow you to rebuild your entire infrastructure in the event of a disaster. This should also be part of your fire drill regime to tet if you can actually recover from a catastrophic failure.
Once upon a time – we built bomb shelters for the unlikely occurrence of being invaded by a foreign power – or a nuclear war. Were these needed? That would probably depend on where in the world you are located and how likely there is to be a war.
These kind of measures are sometimes seen as not necessary – until one they really are needed, and if you have not prepared yourself in advance – then it is too late.
An AWS backup and recovery plan – should be part of your daily operational workflow, something that all your Engineers should know about – and one that should not only be a big red folder on the shelf that no-one ever touches, but rather a living document and procedure that is tested on a regular basis.
You never know when you might need it.