Introduction

Get an overview of the contents and understand the structure of this section.

We'll cover the following

Writing software that works in perfect conditions is easy. It would be nice if we never had to worry about network latency, service timeouts, storage outages, misbehaving applications, users sending bad arguments, security issues, or any of the real-life scenarios we find ourselves in.

Things tend to fail in the following three ways:

  • Immediately

  • Gradually

  • Spectacularly

Immediately is usually the result of a change to application code that causes a service to die on startup or when receiving traffic to an endpoint. Most development test environments or canary rollouts catch these before any real problems occur in production. This type is generally trivial to fix and prevent.

Gradually is usually the result of some type of memory leak, thread/goroutine leak, or ignoring design limitations. These problems build up over time and begin causing problems that result in services crashing or growth in latency at unacceptable levels. Many times, these are easy fixes caught during canary rollouts once the problem is recognized. In the case of design issues, fixes can require months of intense work to resolve. Some rare versions of this have what we call a cliff failure: gradual growth hits a limitation that cannot be overcome by throwing more resources at the problem. That type of problem belongs to our next category.

That category is spectacularly. This is when we find a problem in production that is causing mass failures when a few moments ago everything was working fine. Cellphones everywhere start pinging alerts, dashboards go red, dogs and cats start living together— mass hysteria! This could be the rollout of a bugged service that overwhelms our network, the death of a caching service we depend on, or a type of query that crashes our service. These outages cause mass panic, test our ability to communicate across teams efficiently, and are the ones that show up in news articles.

This section will focus on designing infrastructure tooling to survive the chaos. The most spectacular failures of major cloud companies have often been the results of infrastructure tooling, from Google Site Reliability Engineering (Google SRE) erasing all the disks at their cluster satellites to Amazon Web Services (AWS) overwhelming their network with infrastructure tool remote procedure calls (RPCs).

In this section, we will look at safe ways for first responders (FRs) to stop automation, how to write idempotent workflow tools, packages for incremental backoffs of failed RPCs, provide pacing limiters for rollouts, and much more.

To do this, we will be introducing concepts and packages that will be built into a generic workflow system that we can use to further our education. The system will be able to take requests to do some type of work, will validate that the parameters are correct, validate the request against a set of policies, and then execute that work. In this model, clients (which can be command-line interface (CLI) applications or services) detail work to be done via a protocol buffer and send it to the server. The workflow system does all the actual work.

Structure

We are going to cover the following main topics in this section:

Get hands-on with 1200+ tech skills courses.