Building Systems With an Emergency Stop

Learn how to integrate an emergency-stop package into tooling.

Systems are going to run amok. This is a simple truth that we need to come to terms with early in infrastructure tooling development.

When we are a small company, there is usually a very small group of people who understand the systems well and watch over any changes to handle problems. If those people are good, they can quickly respond to a problem. Usually, these people are the developers of the software.

As companies start to grow, jobs begin to become more specialized. The larger the company, the more specialized the jobs. As that happens, the first responders to major issues don't have the access or knowledge to deal with these problems.

This can create a critical gap between recognition of a major problem and stopping the problem from getting worse.

This is where the ability to allow first responders to stop changes comes into play. We call this an emergency-stop ability.

Understanding emergency stops

There are multiple ways to build an emergency-stop system, but the basics are the same. The software will check some data store that contains the name of the workflow we are executing and what the emergency-stop state is.

The most simplistic version of an emergency-stop system has two modes, as follows:

  • Go

  • Stop

The software that does any type of work would need to reference the system at intervals. If it cannot find itself listed or the system indicates it is in a Stop state, the software terminates, or if it is an execution system, it terminates that workflow.

More complicated versions of this might contain site information so that all tooling running at a site is stopped, or it might include other states such as Pause. These are more complicated to implement, so we will stick to this more simplistic form here.

Let's look at what an implementation of this might look like.

Building an emergency-stop package

The first thing we need to do is define what the data format will look like. For this exercise, we will make it JavaScript Object Notation (JSON) that will be stored on disk. The disk might be a distributed filesystem or a lock file in etcd. And while we are using JSON here, this could be a single table in a database or a protocol buffer.

Let's define the status our workflows can have as follows:

Get hands-on with 1300+ tech skills courses.