Drive Out Through Testing
Learn about QA testing limitations, changing workload, capacity modeling, and system survival in unpredictable traffic.
QA Testing limitations
Unbalanced capacities are another problem rarely observed during QA. The main reason is that QA for every system is usually scaled down to just two servers. So during integration testing, two servers represent the front-end system and two servers represent the back-end system, resulting in a one-to-one ratio. In production, where the big budget is allocated, the ratio could be ten to one or worse. Should we make QA an exact scale replica of the entire enterprise? It would be nice, wouldn’t it? Of course, we can’t do that. We can apply a test harness, though (see Test Harnesses). By mimicking a back-end system wilting under load, the test harness helps you verify that your front-end system degrades gracefully (see Handle Others’ Versions, for more ideas for testing).
Changing workload
On the flip side, if we provide a service, we probably expect a normal workload. That is, we reasonably expect that today’s distribution of demand and transaction types will closely match yesterday’s workload. If all else remains unchanged, then that’s a reasonable assumption. Many factors can change the workload coming at our system, through marketing campaigns, publicity, new code releases in the front-end systems, and especially links on social media and link aggregators. As service providers, we’re even further removed from the marketers who would deliberately cause these traffic changes. Surges in publicity are even less predictable.
Handling unpredictable traffic
So, what can we do if our service serves such unpredictable callers? Be ready for anything.
First, use capacity modeling to make sure we’re at least in the ballpark. Three thousand threads calling into seventy-five threads is not in the ballpark.
Second, don’t just test the system with usual workloads. See what happens if we take the number of calls the front end could possibly make, double it, and direct it all against your most expensive transaction. If our system is resilient, it might slow down, even start to fail fast if it can’t process transactions within the allowed time (see Fail Fast), but it should recover once the load goes down.
Crashing, hung threads, empty responses, or nonsense replies indicate our system won’t survive and might just start a cascading failure.
Third, if possible, use autoscaling to react to surging demand. It’s not a panacea, since it suffers from lag and can just pass the problem down the line to an overloaded platform service. Also, be sure to impose some kind of financial constraint on our autoscaling as a risk management measure.
Tips to remember
Examine server and thread counts
In development and QA, the system probably looks like one or two servers, and so do all the QA versions of the other systems that are called. In production, the ratio might be more like ten to one instead of one to one. Check the ratio of front-end to back-end servers, along with the number of threads each side can handle in production compared to QA.
Observe near Scaling Effects and users
Unbalanced Capacities is a special case of Scaling Effects in which one side of a relationship scales up much more than the other side. A change in traffic patterns, seasonal, market-driven, or publicity-driven, can cause a usually benign front-end system to suddenly flood a back-end system, in much the same way as a hot Reddit post or celebrity tweet causes traffic to suddenly flood websites.
Virtualize QA and scale it up
Even if the production environment is a fixed size, don’t let QA languish at a measly pair of servers. Scale it up. Try test cases where we scale the caller and provider to different ratios. We should be able to automate this all through our data center automation tools.
Stress both sides of the interface
If we provide the back-end system, see what happens if it suddenly gets ten times the highest-ever demand, hitting the most expensive transaction. Does it fail completely? Does it slow down and recover? If we provide the front-end system, see what happens if calls to the back end stop responding or become very slow.
Get hands-on with 1400+ tech skills courses.