In a previous article we discussed how to prepare yourself for a catastrophic outage with your cloud provider. That could occur, but it’s more likely that you will encounter a less widespread outage, which is why you must also prepare for failures at the application level. Each and every application has a number of layers that you can protect from outage – some more than others.
In addition to protecting the layers, there is a need to maintain consistent backups on multiple levels, starting from logical volumes to specific files, for example. In this article, we will discuss how to protect your system on the application level when running on the cloud.
The Storage Layer
Let’s start with the basics. Later on we provide more advanced information and examples.
Your data and content will be stored in your storage repositories. Writing data to disk from within your application is really the only way to ensure data will be there if an instance crashes or is deleted (by mistake, of course). We all are aware that workloads in the cloud are usually transient, therefore writing to local disk is about as good a strategy as building your house in a swamp. Instances come and go, so your data needs to be stored somewhere outside the instance. Most cloud providers offer two different options for this purpose.
This will be a disk that will be attached to your instance as a separate volume and will survive a reboot. In AWS the solution is implemented as Elastic Block Store and in OpenStack this is implemented with Cinder. Basically they both provide the same sort of solution – with nuances and some different functionality. From your application perspective, you will not really need to make any significant changes because the volume appears as a local disk.
For true redundancy, it is important that you replicate your data written to block storage to another secured copy (preferably in a completely different datacenter or location) in order to ensure that in the event of a storage volume failure, you still have a redundant copy elsewhere. Not all providers have this functionality built into their solutions today. As a result, you will need to implement your own solution for the replication of data.
This solution is a filesystem in the cloud where you can read and write objects to the storage. It is not directly attached, and is usually accessed through an API. In AWS, this is implemented with S3, and with OpenStack it is implemented with Swift. As opposed to block storage, an object using the cloud storage is almost always replicated across multiple nodes, eliminating the immediate need for replication (because it is built into the solution). However, replication mechanism is not always suitable for all types of data (the solution is only eventually consistent). And again to ensure proper redundancy, you should ensure that at least data is replicated across different geographical locations.
We talked about the need for persistent storage above, but there are applications that need more than that – they need to save the state of data somewhere, usually in some sort of cache or database. Addressing the issue from an architectural perspective would require you to deploy some kind of database solution. (Some cloud providers give you a DBaaS solution where you don’t have to worry about deploying such a solution yourself.) The solution of course should be redundant – without a single point of failure – and should be scalable to meet your growing needs.
This category includes, for example, Amazon RDS or Aurora, which support users with an automated out-of-the-box backup. However, it’s important to consider the options and features such as the backup retention period. RDS for that matter supports a 35-day retention period, which means you will need to perform your own backup if you have specific regulatory requirements, such as HIPAA, which require documents to be archived for years. In addition, AWS native backup includes replication across availability zones, Here too, if backup is required across regions or continents, you need to create your own custom solution.
Mature SQL databases such as Oracle, MySQL, and PostgreSQL have backup solutions that support persistent backup, but do not scale easily to leverage the global presence of the public cloud vendors. On the other hand, NoSQL databases (such as Mongodb, Cassandra and Couchdb) are easier to distribute and scale. Your backup solution choice will depend on your architecture and application needs.
One of the most important aspects of application-level backup is its consistency, which refers to a state of a stationary backup copy. Inconsistent states of the backup can lead to inadequate recovery when required, ultimately harming the smooth continuation of a service. For that purpose, we should consider consistency on multiple levels – on the infrastructure level (i.e., instances or storage), the logical volume, the file, and the application.
For example, when dealing with a specific machine crash, the recovery processes should be able to get the machine back to its exact previous state, the same applies for all other levels. In order to ensure consistency on all levels, you should use the relevant building block or tool. Amazon EBS snapshot, which provides a point-in-time snapshot of the whole volume, is another good example of an object that supports consistency on the infrastructure level. Recover from a snapshot returns the environment to a healthy state in terms of EBS integrity.
Another example is the use of Linux (Logical Volume Manager) LVM, which supports a backup consistency on the logical volume level. LVM allows users to create a “snapshot volume,” which is a special type of volume that presents all the data that was in the volume at the time the snapshot was created. This means you can back up that volume without having to worry about data being changed while the backup is going on, and you don’t have to take the database volume offline while the backup is taking place.
Throughout this article we have repeated the fact that things need to be deployed in multiple locations, The reason is obvious. We all know that disaster recovery protocols require a backup site and a way to move over in the event of a catastrophe. At the same time, in today’s world, having a dormant site that is not being utilized is really a waste of precious resources. Therefore, some solutions and technologies (such as NoSQL databases or cache software such Redis) support deployment and architectures that can span multiple locations and operate in a fully active-active fashion across these geographies.
Before your applications can talk to each other, they need to know how to find each other.
Maintaining a list of hosts and their IP addresses across each and every server is not sustainable at scale, but this problem was already addressed a long time ago with DNS. Today, we expect a whole lot from our basic services in the cloud, and DNS is no exception. Why settle for basic name resolution when you can can do much more, such as service registration, tagging, versions, etc.
Two examples of such solutions are etcd and consul. With either of these solutions, your workloads in the cloud can find each other easily and recognize the services offered by other instances (for example, you can advertise that you are running a Redis database of version X to all other applications in your cloud environment).
And why geographical redundancy? For the obvious reasons, you need a service similar to 411 directory assistance to map your distributed services on your cloud. Both of the examples above can be deployed in a highly available fashion – and staying true to our mantra – in a geo-dispersed fashion across multiple physical locations.
Someone once said that “Backup is nothing, restore is everything” – and that so true. Unless you actually test that all your procedures and all your detailed plans actually work, you can’t confidently say that you have a robust solution that can survive an outage.
Netflix has invested a huge amount of time and money in constantly improving their systems and even reached a stage where they have code that automatically and randomly starts to destroy parts of their cloud infrastructure to actually test if their systems are resilient and can survive an outage, according to the design.
They have open-sourced their tools,- Chaos Monkey and Chaos Kong, with the intent to test multiple scenarios even a full AWS datacenter/region failure. Tests such as these should be run on a regular basis, and hopefully, even be able to let them run continuously to test your architecture and deployments.
Almost every business today is using one cloud provider or another to enable them to provide quicker time to market and better service to their customers. As a result, you should understand the ramifications of how your applications will be affected in the case of an outage.
We have summarized some of the architectural points that you need to address, where they fit well with your solution, and where you will need additional layers to assist you in your journey to redundancy and help prepare for your plan for recovery from a catastrophic event.