Home/Blog/System Design/Insights in system design: Throughput loss due to high fan-in

Insights in system design: Throughput loss due to high fan-in

11 min read

Feb 27, 2024

content

Data center traffic and the partition-aggregate workload

TCP congestion control and packet loss recovery: Timeout vs. fast recovery

TCP incast: The bane of high fan-in traffic

TCP incast’s avoidance and recovery

Incast recovery techniques

Incast avoidance techniques

The way forward

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Large systems sometimes behave unexpectedly. On the surface, the implementation would seem flawless—potential bugs preempted and best practices followed to a tee. Now, if the system crashes, it is easy to detect the crash, and the problem can be isolated and fixed. However, sometimes the system does not crash. Instead, it runs but either gives incorrect results or negatively impacts the performance. This is a tricky scenario because the malfunction may remain undetected. In this case, various aspects of the system need to be monitored to detect problems and, ideally, preempt them.

In this blog, we will discuss one such problem—the loss of throughputThis is the actual amount of useful data that travels through the network. Throughput calculation ignores the retransmitted data and the control traffic. in high fan-inThis is the number of simultaneous responses coming back to the aggregator. traffic—and derives insights from the discussion.

Let’s go!

Many of the data center applications (e.g., web search, data analytics, recommender system, and social networking) and frameworks (like MapReduce) follow a many-to-one communication pattern. To put it another way, they have a partition-aggregate application workload where a task is split and assigned to different machines, and their responses are aggregated. Let’s look at an example to better understand this point.

In applications such as web search, when a user request arrives at a server (aggregator), it is split into multiple smaller sub-requests and forwarded to different worker machines. The workers process the requests and return their results, which are combined at the aggregator and finally returned to the end user. The fan-in may be as high as 100 to 200 responses converging at the receiver.

But why do we need to split a task and aggregate responses?

This partition-aggregate approach provides many benefits. Scalability and the performance speed-ups achieved through parallelization are the most important benefits. However, these benefits come at a cost. Because of simultaneous responses, the network often sees large congestive traffic bursts. These bursts might lead to a throughput collapse due to what is called TCP incast.

Let’s see what this is all about. But first, let’s take a brief detour to understand how TCP handles network congestion and packet loss.

The way a TCP sender works is that the sender starts by sending a few packets (let’s assume it sends three packets at the start) so as not to overwhelm the underlying network.

The number of packets in transit at any given time is determined by the sender window.
When the packets arrive at the receiver, the receiver acknowledges them by sending back an acknowledgment packet to the sender.
Since there is no packet loss, the sender assumes that there is sufficient bandwidth in the network, and so it increases the window size and sends more packets than before.
It can either increase the window size by one, or it can double the window size depending on whether the sender wants to avoid congestion or wants to quickly ramp up the sending rate to fully utilize the network capacity.
The sender keeps on increasing the window size until it notices any sign of congestion—implied by the packet loss. When that happens, the TCP sender slows down, sending fewer packets at a time.
Depending on the severity of the congestion, the window size may be halved or reduced to one. We’ll talk about this more below.

So the questions are: how does a TCP sender detect packet loss, and how does it determine the severity of traffic congestion? There are two ways a TCP sender infers these:

One way is to infer packet loss from time-outs. When the sender transmits a packet, it starts a timer. If the sender receives an acknowledgment from the receiver for the transmitted packet before the timer expires, it means that the packet has reached the destination. However, if the timer times out and the sender does not receive an acknowledgment, it implies that the packet is lost and the sender retransmits it. Note that the sender calculates the time-out value from the moving average of the round-trip time (RTT)This is the total time taken for a network packet to go from a starting point to the destination and the response to return to the source. of each successfully transmitted packet. A more detailed discussion on calculating the time-out value can be seen in this answer about how to compute DevRTT, estimated RTT, & time-out interval in CCN.

Another way is to infer packet loss from three duplicate acknowledgments. When the receiver receives a packet, it replies with an acknowledgment. These acknowledgments also include sequence numbers of the data that is being acknowledged. In case a packet gets lost but multiple subsequent packets are received, the receiver sends an acknowledgment for each received packet. However, the sequence number in each subsequent acknowledgment is used to signal the sequence number of the expected missing packet. When three duplicate acknowledgments are received that signal the same expected sequence number, the sender infers that the packet must have been lost and so retransmits it. This approach is called fast retransmit.

Fast retransmission is much quicker than waiting for a time-out, which is usually several RTTs long. Fast retransmit allows the sender to react in one RTT and retransmit. Another difference is that fast retransmission implies a level of network congestion that is less severe than the scenario where the timer timed out. This is because in fast retransmit, a packet is lost but multiple packets transmitted after that lost packet are received, unlike the time-out scenario where no packet is received at the receiver (otherwise, some acknowledgments would have been received). Because of this, in case of a time-out, the sender would reduce its transmission window to one, but in case of two duplicate acknowledgments, the window would be halved because the congestion is less severe. Note that there are several variations of TCP that may do congestion control differently from the approach mentioned above. However, the basic idea remains the same.

Now that we have an idea of congestion control and packet loss recovery, let’s look into TCP incast.

In an incast scenario, the requests are sent simultaneously. Therefore, the responses get synchronized and arrive as a large traffic spike at the bottleneck link. The switches usually have shallow buffers, so the space is usually less than the traffic burst. Initial packets of several TCP flows may get dropped. Since there are clustered losses of the flows’ initial packets, the acknowledgments received at the sender may be insufficient to trigger quick recovery (fast retransmit) through three duplicate ACKs. This would lead to TCP time-outs.

Due to system constraints, this time-out waiting period can’t be less than a certain minimum threshold (RTOmin)—usually around 10 ms. Now, let’s compare this to RTTs in a data center, which are usually a few microseconds—a hundredfold smaller than the RTOmin. As a result, the link capacity would go unused for several RTTs. Since multiple flows might be affected, the actual impact would be much greater. Moreover, throughput would be severely degraded. If the traffic were deadline sensitive, for example, the search results of a search engine, the timed-out flows would have to be ignored, causing incomplete results to be sent to the receiver. Even in the absence of any deadlines, the delays would still negatively impact the end user’s experience.

💡 TCP has been studied and known to work well in the wide-area network. That doesn’t mean that TCP would always work. When moving to a different environment, such as a microsecond RTT network, we should carefully revisit all previous measurements and insights and ascertain if they still hold.

A seemingly simple option is to either increase the buffer sizes or decrease the fan-in. The former approach would be expensive and also increase queuing delays because data center traffic also includes long-running background flows due to, for example, data migrations and updates that would fill up the buffers. The latter trades off scalability and reliability for a lower fan-in because the load would now be distributed between fewer worker nodes.

💡 First-order problem-solving might turn out to be a temporary fix and not a valid long-term solution. We always need to go deeper.

Several approaches have been proposed to deal with TCP incast. Broadly, these techniques focus on recovering from the incast, or they try to avoid incast altogether.

One option is to further reduce the TCP’s RTOmin. Since data center RTTs are in microseconds, ideally, we want the TCP’s minimum time-out waiting period to reflect that. If only it were that simple. This approach requires very high-resolution timers that increase system load and adversely affect the performance of other processes running on the servers. So this is not feasible.

💡 Some ideas may seem good in theory and simulations, but when we look at an actual implementation, we begin to realize its inherent weaknesses and constraints.

Some problems can be addressed by relying on a randomization-based approach. Since the initial clustered packet losses may belong to multiple flows, the time-outs would also get synchronized, causing another incast. Each sender can select a random RTOmin value so that it may transmit when other senders are silent, preventing collisions. Senders can also retransmit packets probabilistically without waiting for a time-out so that there will always be some packets in transition at any given time, reducing sender idle time and preventing a throughput collapse. In both cases, selecting the correct probability is critical because a low value would not provide many benefits, while a high probability would cause unnecessary retransmissions or multiple simultaneous transmissions that would further worsen the congestion.

💡 Not all solutions have to be deterministic. Sometimes, a randomization-based approach may be less complex with comparable, if not better, results.

Sometimes, we need some signaling between the senders, receivers, and the underlying network routers for better decision-making. Generating additional control traffic might not be an option because it would further add to the congestion problem. There is a simpler way. When a router needs to drop a packet, it can simply drop the payload and let the packet header pass. The header size may be much smaller. This would reduce the congestion while, at the same time, the receiver/sender would find out that the packet load was dropped due to congestion and take appropriate action without waiting for the time-out.

💡 When designing a solution, try to figure out an approach that requires you to gain the most while making minimum changes to the system. Such a solution would be less complex and easier to adopt.

Sometimes, proactively avoiding a problem is better than recovering from it. Recovering from a situation can incur some delays that have an adverse effect on delay-sensitive traffic and the end user experience. In this case, it is better to keep congestion low and prevent queue buildup and incast.

In application-level staggering, each sender tries to avoid incast by introducing random delays before starting transmission. This prevents the synchronization of flows, reducing the high fan-in traffic burst. However, there is a trade-off—chances of incast are reduced, but the average flow completion time increases.

💡 System design usually includes trade-offs. In the above example, avoiding incast might be good for the network and overall system performance, but for individual flows, it increases completion times.

Some approaches to incast avoidance rely on global scheduling. Each sender requests a central server (arbiter) for transmission time-slot allocation and maybe also for the network path its traffic should follow. Since the central server has a complete picture of the network and traffic demands of each sender, it can orchestrate the traffic in a way that prevents queue buildup. However, such designs add to the additional delays, which might incur a significant overhead for short flows. These designs may also increase the control traffic in the network. Moreover, depending on the load of the arbitration requests and responses, the arbiter may become the bottleneck and face packet losses.