Waiting for Godot

Learn about how a system fails due to poor management, features, and a careless QA process.

The start of the incident

It isn’t enough to write the code. Nothing is done until it runs in production. Sometimes the path to production is a smooth and open highway. Other times, especially with older systems, it’s a muddy track festooned with potholes, bandits, and checkpoints with border guards. This was one of the bad ones. I turned my grainy eyes toward the clock on the wall. The hands pointed to 1:17 a.m. On the Polycom, someone was reporting status. It’s a DBA. One of the SQL scripts didn’t work right, but he “fixed” it by running it under a different user ID.

The wall clock didn’t mean much at the time. Our Lamport clock was still stuck a little before midnight. The playbook had a row that said SQL scripts would finish at 11:50 p.m. We were still on the SQL scripts, so logically we were still at 11:50 p.m. Before dawn, we needed our playbook time and solar time to converge in order for this deployment to succeed.

The first row in the playbook started yesterday afternoon with a round of status reports from each area:

  • Dev
  • QA
  • Content
  • Merchants
  • Order management and so on.

The go or no-go meeting

Somewhere on the first page of the playbook we had a “go or no-go” meeting at 3 p.m. Everyone gave the deployment a go, although QA said that they ...