The 5 AM Problem

Learn about how 30 different application server instances can hang within a five-minute window.

Thirty server instances

One of the sites I launched developed a nasty pattern of hanging completely at almost exactly 5 a.m. every day. The site was running on around 30 different instances, so something was happening to make all 30 different application server instances hang within a five-minute window (the resolution of our URL pinger).

Restarting servers

Restarting the application servers always cleared it up, so there was some transient effect that tipped the site over at that time. Unfortunately, that was just when traffic started to ramp up for the day. From midnight to 5 a.m., only about 100 transactions per hour were of interest, but the numbers ramped up quickly once the East Coast started to come online (one hour ahead of us central time folks).

Restarting all the application servers just as people started to hit the site in earnest was what we’d call a suboptimal approach.

Thread dumps

On the third day that this occurred, we took thread dumps from one of the afflicted application servers. The instance was up and running, but all request-handling threads were blocked inside the Oracle JDBC library, specifically inside of OCI calls. We were using the thick-client driver for its superior failover features.

In fact, once we eliminated the threads that were just blocked trying to enter a synchronized method, it looked as if the active threads were all in low-level socket read or write calls.

Packet capture

Abstractions provide great conciseness of expression. We can go much faster when we talk about fetching a document from a URL than if we have to discuss the tedious details of connection setup, packet framing, acknowledgments, receive windows, and so on. With every abstraction, however, the time comes when we must peel the onion, shed some tears, and see what’s really going on, usually when something is going wrong. Whether for a problem diagnosis or performance tuning, packet capture tools are the only way to understand what’s really happening on the network. The tcpdump is a common UNIX tool for capturing packets from a network interface. Running it in “promiscuous” mode instructs the network interface card (NIC) to receive all packets that cross its wire, even those addressed to other computers. Wireshark can sniff packets on the wire like tcpdump does, but it can also show the packets’ structure in a GUI.

Wireshark runs on the X Window System. It requires a bunch of libraries that might not even be installed in a Docker container or an AWS instance. So it’s best to capture packets non-interactively using tcpdump and then move the capture file to a non-production environment for analysis. The following screenshot shows Wireshark (then called “Ethereal”) analyzing a capture from a home network. The first packet shows an address routing protocol (ARP) request. This happens to be a question from my wireless bridge to my cable modem. The next packet was a surprise: an HTTP query to Google, asking for a URL called /safebrowsing/lookup with some query parameters. The next two packets show a DNS query and response for the michaelnygard.dyndns.org hostname. Packets 5, 6, and 7 are the three-phase handshake for a TCP connection setup. We can trace the entire conversation between the web browser and server. Note that the pane below the packet trace shows the layers of encapsulation that the TCP/IP stack created around the HTTP request in the second packet. The outermost frame is an Ethernet packet. The Ethernet packet contains an IP packet, which in turn contains a TCP packet. Finally, the payload of the TCP packet is an HTTP request. The exact bytes of the entire packet appear in the third pane.

Get hands-on with 1400+ tech skills courses.