Mechanical Advantage

Learn about the control plane, its role, and when it is needed along with its cost. Also learn about mechanical advantages and what happens when automation goes wrong.

Control plane

In the previous chapters we worked our way up from bare metal through layers of abstraction and virtualization to create a sea of instances running on machines. We’ve got software scattered around like an upended box of LEGO blocks. It’s up to the control plane to put these pieces in the right place and knit them together into a somewhat coherent whole.

The control plane encompasses all the software and services that run in the background to make production load successful. One way to think about it is this: if production user data passes through it, it’s production software. If its main job is to manage other software, it’s the control plane.

A challenge we’ll face in this chapter is that the solution space is not well partitioned among tools, packages, and vendors. It’s nowhere near as simple as picking one download from each column. There are overlaps and gaps. Not every combination will work together. No single package does everything. We are left with a lot of integration effort and plenty of trial and error.

Tiers of control

As we look at the control plane, keep in mind that every part of this is optional. We can do without every piece of it, if we’re willing to make some trade-offs. For example, logging and monitoring helps with postmortem analysis, incident recovery, and defect discovery. Without it, all those will take longer or simply not be done. If we can live with extended outages, or if it’s okay to find out our software is down by getting a call from the CEO, then we don’t need that part of the control plane.

In a more palatable example, we don’t need IP management software if we’re running a static network on physical hardware. Up to a certain scale, this is probably acceptable and may be more cost-effective. Once we move to an overlay network with multiple VLANs and software switches, we’ll go mad without IP management.

Cost of implementing control plane

The more sophisticated the control plane becomes, the more it costs to implement and operate. Every piece represents ongoing operational cost. Think of it like trading off the fixed cost of dedicated people versus the variable cost of speeding up deployments, incident recovery, provisioning services, and so on. If we’re small and the rate of change is low, we may find it’s not worth it. If we can amortize the cost of a platform team across hundreds of services deployed hundreds of times per year, then it makes a lot more sense.

This cost equation isn’t static, either. New open-source operations tools are released nearly every day. These are often created by a large-scale company scratching its own itch, but these companies release tools and libraries that lift up everyone else in the industry. When the first edition of this course was published in 2007, logging and monitoring was almost entirely a commercial market. Now it is almost entirely open source. At that time, automated provisioning of operating systems required either a large commercial package (six figures in license cost, six more in implementation cost) or a complete roll-your-own approach. Today, the hardest problem is choosing among all the fantastic alternatives!

Bottom line

Don’t assume you must install one of everything you read about. But also keep evaluating the overhead and difficulty of different solutions. The landscape changes pretty quickly.

Mechanical advantage

Mechanical advantage is the multiplier on human effort that simple machines provide. With mechanical advantage, a person can move something much heavier than themselves. With a long-enough lever and a place to stand, Archimedes claimed he could move Earth itself.

When automation goes wrong

The kicker about mechanical advantage is that it works for good or for ill. High leverage allows a person to make large changes with less effort. We hope that those are mostly beneficial, such as releasing new software to a fleet of ten thousand machines. Unfortunately, there are many examples of automation gone wrong. Back in Force Multiplier, we saw how Reddit suffered from overeager automation. The Governor pattern, aims to reduce the harm when automation goes the wrong way.

Let’s consider an example from a real outage that affected many people and companies. On February 28, 2017, Amazon Web Services’ S3 service in the US-East-1 region went down. Tens of thousands of companies suffered outages due to their own hard dependencies on S3. Large parts of the Net pretty much went dark. Operators went nuts. Users hammered status sites until those crumbled too. The total disruption in S3 lasted about two hours, but it was many more hours before all the S3 consumers were healthy. It was reboot day for a big chunk of the SaaS market. Amazon, like other service providers, has learned that customer confidence can really be shaken with an event like this. One of the most important pieces of communication afterward is a postmortem review of the outage. Every postmortem review has three important jobs to do:

  1. Explain what happened
  2. Apologize
  3. Commit to improvement

Amazon’s write-up does a good job at all three of these 1^{1} . There are some really interesting lessons for us in that postmortem.

Get hands-on with 1400+ tech skills courses.