Best Practices for the Cloud

Design For Failure & Nothing Will Fail, Decouple your components, Implementing elasticity, Automating Your Infrastructure, Thinking parallel.

In this section, you will learn about designing the best practices that will help you build an application in the cloud.

Design For Failure & Nothing Will Fail

Rule of thumb: Be a pessimist when designing architectures in the cloud; assume things will fail. In other words, always design, implement and deploy for automated recovery from failure.

In particular, assume that your hardware will fail. Assume that outages will occur. Assume that some disaster will strike your application. Assume that you will be slammed with more than the expected number of requests per second someday.

If you realize that things will fail over time and incorporate that thinking into your architecture, build mechanisms to handle that failure before disaster strikes to deal with a scalable infrastructure, you will end up creating a fault-tolerant architecture that is optimized for the cloud.

Questions that you need to ask yourself:

What happens if a node in your system fails? How do you recognize that failure? How do I replace that node? What kind of scenarios do I have to plan for? What are my single points of failure? If a load balancer is sitting in front of an array of application servers, what if that load balancer fails? If there are masters and slaves in your architecture, what if the master node fails? How does the failover occur and how is a new slave instantiated and brought into sync with the master? Just like designing for hardware failure, you have to also design for software failure.

Questions that you need to ask:

What happens to my application if the dependent services change their interface? What if downstream service times out or returns an exception? What if the cache keys grow beyond the memory limit of an instance? Build mechanisms to handle that failure. For example, the following strategies can help in event of failure:

  1. Have a coherent backup and restore strategy for your data and automate it.
  2. Build process threads that resume on reboot.
  3. Allow the state of the system to re-sync by reloading messages from queues.
  4. Keep pre-configured and pre-optimized virtual images to support on launch/boot.
  5. Avoid in-memory sessions or stateful user context, move that to data stores. Good cloud architectures should be impervious to reboots and re-launches. You can do this using a combination of Amazon SQS and Amazon SimpleDB, the overall controller architecture is very resilient to the types of failures listed in this section.

For instance, if the instance on which the controller thread was running dies, it can be brought up and resume the previous state as if nothing had happened. This was accomplished by creating a pre-configured Amazon Machine Image, which when launched dequeues all the messages from the Amazon SQS queue and reads their states from an Amazon SimpleDB domain on reboot.

Designing with an assumption that underlying hardware will fail, will prepare you for the future when it actually fails. This design principle will help you design operations-friendly applications.

If you can extend this principle to proactively measure and balance load dynamically, you might be able to deal with variance in network and disk performance that exists due to the multi-tenant nature of the cloud.

Tactics for implementing the above best practice:

  1. Failover Gracefully Using Elastic IPs: Elastic IP is a static ...