Distributed Systems: Building Software for the Real World/

...

The 5 AM Problem

Learn about how 30 different application server instances can hang within a five-minute window.

We'll cover the following...

Thirty server instances

Restarting servers

Thread dumps
Packet capture
Reviewing the database
Restoring services

Thirty server instances

One of the sites I launched developed a nasty pattern of hanging completely at almost exactly 5 a.m. every day. The site was running on around 30 different instances, so something was happening to make all 30 different application server instances hang within a five-minute window (the resolution of our URL pinger).

Restarting servers

Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 a.m., only about 100 transactions per hour were of interest, but the numbers ramped up quickly once the East Coast started to come online (one hour ahead of us central time folks).

Restarting all the application servers just as people started to hit the site in earnest was what we’d call a suboptimal approach.

Thread dumps

On the third day that ...

Living in Production

The Exception That Grounded an Airline

Stabilize Your System

Stability Antipatterns

Failures And Blockages

Force Multiplier

Stability Patterns

Launching An Online Store

Foundations

Processes on Machines

Interconnect

Control Plane

Security

Design for Deployment

Handling Versions

Case Study: Trampled by Your Own Customers

Adaptation

System Architecture

Information Architecture

Chaos Engineering

Bibliography

The 5 AM Problem

Thirty server instances

Restarting servers

Thread dumps