Murder by the Masses

Learn about how we found the reason for the crash, including the exceeded number of sessions, refetching cached pages, misbehaving bots, and American Registry for Internet Numbers.

What caused the crash?

So after all that load testing, what happened on the day of the launch? How could the site crash so badly and so fast? Our first thought was that marketing was just way off on their demand estimates. Perhaps the customers had built up anticipation for the new site. That theory died quickly when we found out that customers had never been told the launch date. Maybe there was some misconfiguration or mismatch between production and the test environment?

Exceeded number of sessions

The session counts led us almost straight to the problem. It was the number of sessions that killed the site. Sessions are the Achilles’ heel of every application server. Each session consumes resources, mainly RAM. With session replication enabled (it was), each session gets serialized and transmitted to a session backup server after each page request. That meant the sessions were consuming RAM, CPU, and network bandwidth. Where could all the sessions have come from?

Eventually, we realized that noise was our biggest problem. All of our load testing was done with scripts that mimicked real users with real browsers. They went from one page to another linked page. The scripts all used cookies to track sessions. They were polite to the system.

Get hands-on with 1400+ tech skills courses.