Introduction

Learn how to build a supervisor for the game module.

We'll cover the following

Game supervision

We all work with computers. We know it’s inevitable that things will go wrong—sometimes really wrong. Despite our best-laid plans, the state goes terrible, and systems raise exceptions while servers crash. These failures often seem to come out of nowhere.

To combat these inevitable problems, we need to boost fault tolerance. We need to isolate failures as much as possible, handle them, and have the system carry on as a whole.

Elixir and OTP provide a world-class mechanism to handle problems and to move on. This mechanism is called process supervision. Process supervision means we can have specialized processes that watch other functions and restart them when they crash.

The mechanism we use to define and spawn these specialized processes is the supervisor.

We’ll now build our supervisor for the Game module. We make sure it starts a new process when we start the game engine and use that supervisor process to create and supervise each game process.

Along the way, we look at some ideas about fault tolerance, examine different ways we can spawn processes in Elixir, and take a look at supervisor behavior.

The first step in this path is understanding the ways different languages provide fault tolerance.

Fault tolerance

Erlang and Elixir have reputations for tremendous fault tolerance. This is well deserved, but not because they prevent errors. Instead, they give us the tools to recover gracefully from any errors that crop up at runtime.

Almost all languages, including Elixir, have built-in mechanisms to handle exceptions. These require that we identify risky code in advance, wrap it in a block that tries to execute it and provide a block to rescue the situation if the code fails. In most languages, this kind of exception handling is essential, but in Elixir, we hardly ever have to reach for it.

The OTP team at Ericsson took an interesting tack around this pattern. For them, extreme fault tolerance wasn’t a “nice to have.” It was a critical requirement of the language. Telephone utilities have very stringent uptime requirements. The phones need to work no matter what, even in the event of a natural disaster.

The team reasoned that it’s nearly impossible to predict all possible failures in advance, so they decided to focus on recovering from the failure instead. They wanted to code for the happy path and have a separate mechanism to get things back on track when the inevitable errors happened.

The design they came up with is the supervisor’s behavior. It extracts error handling code from business logic into its own modules. Supervisor modules spawn supervisor processes. These processes link to other processes, watch for failure, and restart those linked processes if they crash.

This separation of concerns makes our code clearer and easier to maintain. It keeps our business logic free of diversions for handling exceptions. We end up writing more confident code that assumes success. However, supervisors always have our back when things go wrong.

The supervisor behavior is based on ideas that build on and reinforce each other:

  1. Most runtime errors are transient and happen because of a bad state.
  2. The best way to fix a bad state is to let the process crash and restart it with a good state.
  3. Restarts work best on systems like the BEAM that have small, independent processes. Independent processes let us isolate errors to the smallest area possible and minimize any disruption during restarts.

History shows the OTP team’s approach is more effective than exception handling. We’ve built more fault-tolerant systems not because we anticipated specific errors, but because we have the tools to recover gracefully from all errors when they inevitably crop up.

We’re going to look at process supervision from all angles. We start with the different ways we can spawn new processes and their implications on how processes interact. This will lead us to the way supervisors interact with the processes they supervise. We’ll see the different strategies supervisors use for restarting crashed processes, and then we’ll talk about different ways of recovering the state after processes restart.

To illustrate how process supervision works, let’s start with the different ways we can spawn processes and the effect that has on the way processes interact.

Get hands-on with 1300+ tech skills courses.