Waiting for Godot
Learn about how a system fails due to poor management, features, and a careless QA process.
The start of the incident
It isn’t enough to write the code. Nothing is done until it runs in production. Sometimes the path to production is a smooth and open highway. Other times, especially with older systems, it’s a muddy track festooned with potholes, bandits, and checkpoints with border guards. This was one of the bad ones. I turned my grainy eyes toward the clock on the wall. The hands pointed to 1:17 a.m. On the Polycom, someone was reporting status. It’s a DBA. One of the SQL scripts didn’t work right, but he “fixed” it by running it under a different user ID.
The wall clock didn’t mean much at the time. Our Lamport clock was still stuck a little before midnight. The playbook had a row that said SQL scripts would finish at 11:50 p.m. We were still on the SQL scripts, so logically we were still at 11:50 p.m. Before dawn, we needed our playbook time and solar time to converge in order for this deployment to succeed.
The first row in the playbook started yesterday afternoon with a round of status reports from each area:
- Dev
- QA
- Content
- Merchants
- Order management and so on.
The go or no-go meeting
Somewhere on the first page of the playbook we had a “go or no-go” meeting at 3 p.m. Everyone gave the deployment a go, although QA said that they ...