Home/Blog/System Design/Taylor Swift Ticketmaster Meltdown: A System Design Analysis
Home/Blog/System Design/Taylor Swift Ticketmaster Meltdown: A System Design Analysis

Taylor Swift Ticketmaster Meltdown: A System Design Analysis

Fahim ul Haq
Nov 17, 2023
12 min read
content
What went wrong with Ticketmaster France?
Unpacking the 2022 Ticketmaster fiasco
Flaws in the Verified Fans system
Cascading Failures 
How should you architect a system to scale up for high demand?
Caching
Graceful degradation
Build a waiting room and set a time limit to buy
Did 3rd party dependencies doom Ticketmaster?
But is demand just too high?
The future of limited drop systems
share

While the final numbers are still being tallied, Taylor Swift’s North American Eras Tour is likely the highest grossing concert tour ever. Revenue from ticket sales alone are estimated to be in the billions. 

Hundreds of thousands of Swifties flocked to city stadiums every week, bringing significant economic upswings to every city the tour touched. Swift's financial impact has been estimated to have stimulated the US economy by as much as $5B dollars (including travel, food, merchandise, and more). Dubbed "Taylornomics," other North American cities have been clamoring for a piece of the action. Plus, with a feature film now in theaters, the Taylor Swift hype has little signs of slowing down.

Suffice to say, the Eras Tour made history on many fronts — including multiple Ticketmaster crashes.

When tickets went on sale a few months ago for the European leg of the Eras Tour, Ticketmaster France was unable to handle the traffic. As checkout errors piled in, they paused the entire sale for six Taylor Swift shows slated for Spring 2024. On X, @TicketmasterFR released a short statement that pointed fingers at an unnamed "3rd party dependency"

This wasn't the only time an Eras Tour ticket drop went awry, though. The tour got off to a rocky start too. When tickets went on sale for the Eras tour back in November of 2022, thousands and thousands of fans were unable to get tickets due to internal failures in the Ticketmaster system.

With news of Ticketmaster still upsetting Swifties, I'd like to recap where Ticketmaster went wrong, and then outline some best practices when it comes to building large-scale software systems that can tolerate extreme spikes in traffic. 

What went wrong with Ticketmaster France?

The recent headlines of Ticketmaster's technical woes were smaller than those in November 2022, so they have not released a blog post addressing what specifically went wrong. Ticketmaster is a fairly large system, so it is impossible to pinpoint a single cloud provider hiccup that could have caused the delay for French Swifties. 

That said, pausing the entire sale points to a dependency that is critical to the ticket checkout process, and one that could very easily cause cascading failures (more on that later). With that in mind, the dependency problem may have been related to the payment processing workflow from PayPal. It is common knowledge now that both Ticketmaster and LiveNation rely on PayPal as their primary global payment processing platform.

But enough speculating, let's get to the concrete details of the Ticketmaster meltdown that brought them into the eyes of the public and the courts.

widget

Unpacking the 2022 Ticketmaster fiasco

Following the initial botched sale of Eras Tour tickets in 2022, Ticketmaster released a short statement identifying the root cause of the issues that occurred. I’ll summarize their findings here:

Flaws in the Verified Fans system

Ticketmaster originally implemented the Verified Fans system to differentiate bots from actual humans in order to combat scalpers and ticket resellers. 

There were 3.5 million verified fans registered for the presale. Of those, 1.5 million received an invite passcode, and 2 million were waitlisted.

This would normally be a good solution for shortening wait times and making ticket sales smoother. However, Taylor Swift generated an "historically unprecedented demand" that their platform struggled to accommodate. 

  • Ticketmaster was prepared to handle the 1.5 million verified fans who were invited to purchase tickets, but were overwhelmed when over 14 million showed up.

  • Over 15% of interactions across the site experienced issues, including passcode validation errors

  • Their system failed to reject new requests once system capacity was exceeded, so users continued to send requests that would ultimately overload and crash their servers.

Ultimately, this meant that some users couldn't check out even after their tickets were carted. In an effort to control the damage, Ticketmaster attempted to waitlist more customers and delay sales, resulting in longer queues. Many verified fans waited hours in these queues only to be met with error messages.

In other words, an ounce of prevention is worth a pound of cure. Placing a hard limit on the number of requests could have prevented a great deal of confusion and frustration.

Cascading Failures 

While it was clear from the start that the Eras Tour would generate an enormous demand for tickets, the magnitude of that demand would be realized too late. 

Ticketmaster's first mistake was being underprepared for the sheer volume of customers that tried to purchase tickets. Capacity planning is an important aspect of System Design, and when a system doesn't have sufficient safeguards in place to throttle requests, things can quickly spiral out of control. 

Some significant contributing factors were identified in the post-mortem:

  • Low supply, high demand: Ticketmaster sold 2.4 million tickets compared to the 14 million customers and bots trying to buy said tickets. Their post-mortem indicated that most fans invited to the pre-sale purchase 3 tickets on average. 

  • Bots (DDoS attack): To make matters worse, Ticketmaster had to contend with ticket scalpers trying to purchase tickets before verified users had a chance using bots that did not have presale codes, which undermined the intended effectiveness of the Verified Fans system. 

  • “Retry hell”: A multiplier effect occurred as customers repeatedly tried to purchase tickets while they were held in limbo with no way of knowing if their request would go through or be successful. 

  • Third-party dependencies: Ticketmaster relied on third-party payment processors such as Paypal, so when their system crashed, these services were unable to process payments or return information in a timely manner.

With these four factors in play, the number of requests ballooned to a staggering 3.5 billion — almost four times their previous peak.  

The resulting effect is similar to what a DDoS attack would look like. While I'm certain that Ticketmaster will learn to prioritize more robust capacity planning measures going forward, it is a bit unexpected to see a company that should be wired for this kind of moment fail under pressure. 

The main takeaway from this particular case study should be the criticality for businesses to take a proactive approach when it comes to capacity planning. With the right contingency measures in place, similar scenarios can be avoided.

How should you architect a system to scale up for high demand?

Live queuing is a really hard problem to solve. If Ticketmaster's goal is to have hundreds of thousands or even millions of users live queue for a ticket drop event, it will take a great deal of processing power. Timestamps are not granular enough to queue any appreciable number of concurrent users. (There are better approaches to serving customers than live queuing, but we'll get to those a little later.)

This scenario is not new to the world of distributed computing and even has a name you may have heard of before: "The Thundering Herd" problem. Large distributed systems like Facebook have dealt with far more extreme thundering herds than the Taylor Swift fans. What happened to Ticketmaster is not an unsolved problem.

Given the outcome, we can assume that Ticketmaster doesn't have a lot of elastic capacity. Elastic capacity refers to the availability of backup servers used to handle an increase in traffic. Normally, these extra servers are used for non-critical data or non-time-sensitive processes. When there's a spike in website visitors, these reserve servers can be accessed to help manage the extra load.

Just adding more computing power seems like a bit of a simplistic solution, however. So, let's touch on three ways that systems can be designed to scale when demand is high.

Caching

The most obvious way to deal with high traffic loads is to cache as much data as possible. Caching can be done at the server or client level and is especially useful for responses to frequently requested resources. If you can speed up the delivery, you can serve more users, all while utilizing less compute power. 

Graceful degradation

This is a classic capacity planning consideration that I'll touch on very briefly. At its simplest, graceful degradation is essentially a way of gradually refusing requests. As load increases, the workflow would look something like this:

  • All systems working normally, no requests rejected.

  • Heavy load, introduce a random queuing pattern for requests.

  • Load critical, randomly reject requests.

  • System life support, reject all requests.

Graceful degradation ensures that core system functions are available even when other extraneous areas fail. A great example of graceful degradation in the real-world are escalators: when an escalator stops moving, it becomes a set of stairs. It's more efficient when working as intended, but the end result is the same.

Build a waiting room and set a time limit to buy

Ticketmaster already has a waiting room it calls "Smart Queue," and they set a time limit to check out with tickets. These are well-known best practices for addressing bot attacks or users holding tickets without the intent of purchasing them. 

Ticketmaster doesn't explicitly enforce how many users can try to buy tickets when their queue finally lifts. The Ticketmaster system attempts to "hold" the tickets that are placed in a cart, and then gives a time limit to complete the transaction.

Where the problem arises is when the tickets that are carted are not actually available. When this happens, an error is delivered at checkout and the user goes back to the Interactive Seat Map (ISM) to put another ticket in their cart. This can cause an unprecedented crush of users converging on the tickets that the system is listing for sale.

widget

Did 3rd party dependencies doom Ticketmaster?

The short answer is that the best way for Ticketmaster to prevent this from happening in the future really depends on how Ticketmaster is internally designed. A system is only as strong as its weakest link. I'm sure that their engineers are working hard to address and flag bottlenecks.

Since most of the failures occurred after users carted tickets, it's possible that payment processing dependencies through PayPal, or card provider platforms like VISA and Mastercard, caused cascading failures throughout the system. 

But is demand just too high?

That's part of the problem, yes. The unique conditions leading up to Taylor Swift's tour ensured a fanbase that was ravenous for the singer's next performance. Prior to the Era's Tour, Swift had not taken the stage since 2018 – for the Reputation Tour. Her long absence from the stage, combined with the passionate fervor of post-pandemic concert-goers, created an unprecedented demand for tickets. 

Ticketmaster estimates, "based on the volume of traffic to our site, Taylor would need to perform over 900 stadium shows (almost 20x the number of shows she is doing)." Even if Ticketmaster was able to pull off a flawless drop, many fans would still be left empty-handed.

Knowing all of this, plus the social media hype of her NA tour, it was not hard to see that the ticket sale for EU Swifties would be a challenge. 

So, what needs to change?

In general, it is a good practice to brace for 10 billion system calls by deploying all the mitigation tactics mentioned above (caching, elastic demand, graceful degradation, etc). This is much more traffic than the 3.5B that Ticketmaster reported, so it's possible that their system is held up by one tricky bottleneck that is causing cascading failures throughout the system. 

Sadly, without a look under the hood at Ticketmaster we may never know exactly how they can be better prepared. 

The future of limited drop systems

Taking a step back from capacity planning for a moment, let's consider the presale process from the user's perspective.

Some users feel that Ticketmaster's Verified Fans system is simply too complicated. Many fans have reported spending hours in a queue only to contend with checkout errors when they finally reach the front of the line. This whole pre-sale process takes a lot of time – sometimes up to four or five hours. And this doesn't include the prep time it takes to register for the individual ticket drop as a Verified Fan and receive a pre-sale code.

For a regular fan with normal day-to-day obligations this commitment can be untenable.

There are multiple layers of granularity that can be added to help alleviate stress on both software systems and consumers.

  • Lottery: Some users are randomly selected to have the opportunity to purchase the limited item. If you don't get selected you'll have to resort to the aftermarket, but you can take solace in knowing that everyone has an equal chance.

  • Tranches: Releasing items in limited batches is usually much better than all at once, especially for a software system. Ticketmaster has dabbled in this idea with multiple limited presales, but it could help even more were they to implement the strategy for the full release as well.

  • IP segmentation: Releases could even be staggered by IP region. Yes, users could potentially fool the system with a VPN, but it would still help to diversify the load.

While I don't think Ticketmaster will completely go back to the drawing board for their pre-sale workflow, it is important to remember that capacity limitations and other System Design bottlenecks can sometimes be addressed by optimizing other areas

This type of holistic problem-solving is a key aspect of System Design, and one that we talk about extensively in our course, Grokking Modern System Design Interview for Engineers & Managers.

This course is a great place to learn System Design fundamentals — for interviews and beyond. One of my personal favorite parts of the course is that you can practice your new skills by applying them to 13 real-world System Design problems, like Design YouTube or Design Uber. Explore Educative courses and projects today and get a free 7-day trial

Thanks for reading, and let me know what you'd like to hear about next!

Happy Learning!