The 5 AM Problem

Learn about how 30 different application server instances can hang within a five-minute window.

Thirty server instances

One of the sites I launched developed a nasty pattern of hanging completely at almost exactly 5 a.m. every day. The site was running on around 30 different instances, so something was happening to make all 30 different application server instances hang within a five-minute window (the resolution of our URL pinger).

Restarting servers

Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 a.m., only about 100 transactions per hour were of interest, but the numbers ramped up quickly once the East Coast started to come online (one hour ahead of us central time folks).

Restarting all the application servers just as people started to hit the site in earnest was what we’d call a suboptimal approach.

Thread dumps

On the third day that ...