Waiting for Godot

Learn about how a system fails due to poor management, features, and a careless QA process.

We'll cover the following...

The start of the incident

It isn’t enough to write the code. Nothing is done until it runs in production. Sometimes the path to production is a smooth and open highway. Other times, especially with older systems, it’s a muddy track festooned with potholes, bandits, and checkpoints with border guards. This was one of the bad ones. I turned my grainy eyes toward the clock on the wall. The hands pointed to 1:17 a.m. On the Polycom, someone was reporting status. It’s a DBA. One of the SQL scripts didn’t work right, but he “fixed” it by running it under a different user ID.

The wall clock didn’t mean much at the time. Our Lamport clock was still stuck a little before midnight. The playbook had a row that said SQL scripts would finish at 11:50 p.m. We were still on the SQL scripts, so logically we were still at 11:50 p.m. Before dawn, we needed our playbook time and solar time to converge in order for this deployment to succeed.

The first row in the playbook started yesterday afternoon with a round of status reports from each area:

  • Dev
  • QA
  • Content
  • Merchants
  • Order management and so on.

The go or no-go meeting

Somewhere on the first page of the playbook we had a “go or no-go” meeting at 3 p.m. Everyone gave the deployment a go, although QA said that they ...