Distributed Systems: Building Software for the Real World/

...

Targeting Chaos

Learn about randomness in chaos testing, faults and failures, cunning malevolent intelligence, chaos automation, and repeat.

We'll cover the following...

Randomness
Faults and failures
When to avoid randomness?
Cunning malevolent intelligence
Automate and repeat

How much chaos to apply

Faults and failures

We’re looking for faults that lead to failures. Many faults won’t cause failures. In fact, on any given day, most faults don’t result in failures. More about that later in this chapter. When we inject faults into service-to-service calls, we’re searching for the crucial calls. As with any search problem, we have to confront the challenge of dimensionality.

Suppose there’s a partner data load process that runs every Tuesday. A fault during one part of that process causes bad data in the database. Later, when using that data to present an API response, a service throws an exception and returns a 500 response code. How likely are we to find that problem via random search? Not very likely.

When to avoid randomness?

Randomness works well at the beginning because the search space for faults is densely populated. As we progress, the search space becomes more sparse, but not uniform. Some services, some network segments, and some combinations of state and request will still have latent killer bugs. But imagine trying to exhaustively search a $2 n$ ...

Living in Production

The Exception That Grounded an Airline

Stabilize Your System

Stability Antipatterns

Failures And Blockages

Force Multiplier

Stability Patterns

Launching An Online Store

Foundations

Processes on Machines

Interconnect

Control Plane

Security

Design for Deployment

Handling Versions

Case Study: Trampled by Your Own Customers

Adaptation

System Architecture

Information Architecture

Chaos Engineering

Bibliography

Targeting Chaos

Randomness

Faults and failures

When to avoid randomness?