Large systems sometimes behave unexpectedly. On the surface, the implementation would seem flawless—potential bugs preempted and best practices followed to a tee. Now, if the system crashes, it is easy to detect the crash, and the problem can be isolated and fixed. However, sometimes the system does not crash. Instead, it runs but either gives incorrect results or negatively impacts the performance. This is a tricky scenario because the malfunction may remain undetected. In this case, various aspects of the system need to be monitored to detect problems and, ideally, preempt them.
In this blog, we will discuss one such problem—the loss of
Let’s go!
Many of the data center applications (e.g., web search, data analytics, recommender system, and social networking) and frameworks (like MapReduce) follow a many-to-one communication pattern. To put it another way, they have a partition-aggregate application workload where a task is split and assigned to different machines, and their responses are aggregated. Let’s look at an example to better understand this point.
In applications such as web search, when a user request arrives at a server (aggregator), it is split into multiple smaller sub-requests and forwarded to different worker machines. The workers process the requests and return their results, which are combined at the aggregator and finally returned to the end user. The fan-in may be as high as 100 to 200 responses converging at the receiver.
But why do we need to split a task and aggregate responses?
This partition-aggregate approach provides many benefits. Scalability and the performance speed-ups achieved through parallelization are the most important benefits. However, these benefits come at a cost. Because of simultaneous responses, the network often sees large congestive traffic bursts. These bursts might lead to a throughput collapse due to what is called TCP incast.
Let’s see what this is all about. But first, let’s take a brief detour to understand how TCP handles network congestion and packet loss.
The way a TCP sender works is that the sender starts by sending a few packets (let’s assume it sends three packets at the start) so as not to overwhelm the underlying network.
So the questions are: how does a TCP sender detect packet loss, and how does it determine the severity of traffic congestion? There are two ways a TCP sender infers these:
Fast retransmission is much quicker than waiting for a time-out, which is usually several RTTs long. Fast retransmit allows the sender to react in one RTT and retransmit. Another difference is that fast retransmission implies a level of network congestion that is less severe than the scenario where the timer timed out. This is because in fast retransmit, a packet is lost but multiple packets transmitted after that lost packet are received, unlike the time-out scenario where no packet is received at the receiver (otherwise, some acknowledgments would have been received). Because of this, in case of a time-out, the sender would reduce its transmission window to one, but in case of two duplicate acknowledgments, the window would be halved because the congestion is less severe. Note that there are several variations of TCP that may do congestion control differently from the approach mentioned above. However, the basic idea remains the same.
Now that we have an idea of congestion control and packet loss recovery, let’s look into TCP incast.
In an incast scenario, the requests are sent simultaneously. Therefore, the responses get synchronized and arrive as a large traffic spike at the bottleneck link. The switches usually have shallow buffers, so the space is usually less than the traffic burst. Initial packets of several TCP flows may get dropped. Since there are clustered losses of the flows’ initial packets, the acknowledgments received at the sender may be insufficient to trigger quick recovery (fast retransmit) through three duplicate ACKs. This would lead to TCP time-outs.
Due to system constraints, this time-out waiting period can’t be less than a certain minimum threshold (RTOmin)—usually around 10 ms. Now, let’s compare this to RTTs in a data center, which are usually a few microseconds—a hundredfold smaller than the RTOmin. As a result, the link capacity would go unused for several RTTs. Since multiple flows might be affected, the actual impact would be much greater. Moreover, throughput would be severely degraded. If the traffic were deadline sensitive, for example, the search results of a search engine, the timed-out flows would have to be ignored, causing incomplete results to be sent to the receiver. Even in the absence of any deadlines, the delays would still negatively impact the end user’s experience.
💡 TCP has been studied and known to work well in the wide-area network. That doesn’t mean that TCP would always work. When moving to a different environment, such as a microsecond RTT network, we should carefully revisit all previous measurements and insights and ascertain if they still hold.
A seemingly simple option is to either increase the buffer sizes or decrease the fan-in. The former approach would be expensive and also increase queuing delays because data center traffic also includes long-running background flows due to, for example, data migrations and updates that would fill up the buffers. The latter trades off scalability and reliability for a lower fan-in because the load would now be distributed between fewer worker nodes.
💡 First-order problem-solving might turn out to be a temporary fix and not a valid long-term solution. We always need to go deeper.
Several approaches have been proposed to deal with TCP incast. Broadly, these techniques focus on recovering from the incast, or they try to avoid incast altogether.
One option is to further reduce the TCP’s RTOmin. Since data center RTTs are in microseconds, ideally, we want the TCP’s minimum time-out waiting period to reflect that. If only it were that simple. This approach requires very high-resolution timers that increase system load and adversely affect the performance of other processes running on the servers. So this is not feasible.
💡 Some ideas may seem good in theory and simulations, but when we look at an actual implementation, we begin to realize its inherent weaknesses and constraints.
Some problems can be addressed by relying on a randomization-based approach. Since the initial clustered packet losses may belong to multiple flows, the time-outs would also get synchronized, causing another incast. Each sender can select a random RTOmin value so that it may transmit when other senders are silent, preventing collisions. Senders can also retransmit packets probabilistically without waiting for a time-out so that there will always be some packets in transition at any given time, reducing sender idle time and preventing a throughput collapse. In both cases, selecting the correct probability is critical because a low value would not provide many benefits, while a high probability would cause unnecessary retransmissions or multiple simultaneous transmissions that would further worsen the congestion.
💡 Not all solutions have to be deterministic. Sometimes, a randomization-based approach may be less complex with comparable, if not better, results.
Sometimes, we need some signaling between the senders, receivers, and the underlying network routers for better decision-making. Generating additional control traffic might not be an option because it would further add to the congestion problem. There is a simpler way. When a router needs to drop a packet, it can simply drop the payload and let the packet header pass. The header size may be much smaller. This would reduce the congestion while, at the same time, the receiver/sender would find out that the packet load was dropped due to congestion and take appropriate action without waiting for the time-out.
💡 When designing a solution, try to figure out an approach that requires you to gain the most while making minimum changes to the system. Such a solution would be less complex and easier to adopt.
Sometimes, proactively avoiding a problem is better than recovering from it. Recovering from a situation can incur some delays that have an adverse effect on delay-sensitive traffic and the end user experience. In this case, it is better to keep congestion low and prevent queue buildup and incast.
In application-level staggering, each sender tries to avoid incast by introducing random delays before starting transmission. This prevents the synchronization of flows, reducing the high fan-in traffic burst. However, there is a trade-off—chances of incast are reduced, but the average flow completion time increases.
💡 System design usually includes trade-offs. In the above example, avoiding incast might be good for the network and overall system performance, but for individual flows, it increases completion times.
Some approaches to incast avoidance rely on global scheduling. Each sender requests a central server (arbiter) for transmission time-slot allocation and maybe also for the network path its traffic should follow. Since the central server has a complete picture of the network and traffic demands of each sender, it can orchestrate the traffic in a way that prevents queue buildup. However, such designs add to the additional delays, which might incur a significant overhead for short flows. These designs may also increase the control traffic in the network. Moreover, depending on the load of the arbitration requests and responses, the arbiter may become the bottleneck and face packet losses.
💡 Having the complete picture might seem attractive. However, it may incur significant overhead and hidden costs for individual components.
This blog looked at the loss of throughput in data centers running applications with high fan-in traffic. It discussed how the fan-in, TCP’s loss recovery, congestion control algorithms, and operating system constraints lead to the throughput collapse called TCP incast. Various solutions were also discussed, and insights for system design and problem-solving were highlighted. To get more comfortable with system design, have a look at the following courses:
Free Resources