What is branch prediction?

Branch prediction is a technique to predict the outcome of a conditional operation. It's an essential part of modern CPUs. They contain a dedicated hardware unit called a branch prediction unit (BPU). BPU can predict which branch will be chosen with high confidence given the current encoded instruction.

Why is branch prediction needed?

Modern CPUs implement pipelining to ensure that no cycle goes wasted. Pipeline ensures that no portion of the CPU is sitting idle waiting for the information to be fetched or decoded, so the next instruction is fetched. In contrast, the current instruction is being decoded.

However, when a branch instruction comes to the CPU, it's not sure which instruction to fetch next as that depends on the current instruction’s result. This leads to cycles being wasted as the CPU waits for the next instruction to be fetched once the current instruction finishes its execution. To counter this problem, BPU predicts the branch, and the CPU fetches it while the current instruction is being decoded. This can lead to one of the two situations as follows:

  • Correct prediction: This will result in no wastage of cycles resulting in faster execution.

  • Incorrect prediction: This will cause the CPU to fetch the correct instruction and discard the already-fetched instruction, leading to the wastage of cycles.

Accurate prediction techniques need to be used to increase the efficiency of the CPUs. Branch prediction techniques can be divided into two main categories given as follows:

  • Static branch prediction

  • Dynamic branch prediction

Static branch prediction

The static branch prediction is the most straightforward branch prediction technique as it always predicts the same whenever it sees a specific branch. The possible decisions of a static branch predictor are as follows:

  • A branch is taken.

  • A branch is not taken.

  • The forward branch (if-else) is not taken, and the backward branch (loop) is taken.

These decisions remain the same and do not rely on the branch’s history.

Example

To understand it further, we consider a simple example where we populate an array of size 1010 with values between 00 and 9999 inclusive. Then we sort the array and then traverse through the array to see if an index contains a value greater than equal to 5050. The working of static branch prediction is shown below:

1 of 7

Note: On each iteration of the for loop, the branch prediction also takes place, but for simplicity, we ignore that in the example above.

We can see that predictions by the static branch prediction technique resulted in the correct branch half the time and wrong in the other half, making the accuracy 50%50\%, which is decent. Still, modern CPUs can not rely upon static branch prediction.

Dynamic branch prediction

Dynamic branch predictors rely on the past results of the conditional operation to predict whether to take the branch or not. There are many dynamic branch predicting techniques. However, the most popular ones are as follows:

  • 1-bit branch prediction

  • 2-bit branch prediction

The 1-bit prediction

The 1-bit dynamic branch prediction technique relies on the most recent result of the conditional operator to predict the next branch. To understand this further, we consider the following example:

1 of 10

We can see that the 1-bit dynamic branch predictor predicts the correct branch 88 out of 1010 times which is an accuracy of 80%80\%. This represents a very good improvement over the static predictor, which predicts with a 50%50\% accuracy.

However, the data in this example suits the 1-bit dynamic predictor, so this accuracy can decrease or increase depending upon the data. In most cases, a 1-bit dynamic predictor will still provide better accuracy than the static predictor.

The 2-bit prediction

The 2-bit predictor is another type of dynamic branch predictor. This considers the last two outputs of the conditional operator. It changes its output when its prediction is false twice. The 2-bit predictor has a 4-state Finite Automata (FA) compared to the 2-state FA of the 1-bit dynamic branch predictor.

A 4-state FA of the 2-bit dynamic branch predictor

This FA is complicated compared to the one we saw in the example above. Although the basics remain the same, if the conditional operator runs for the first time, the initial prediction is randomly generated. After that, the states are changed according to the latest output of the operator.

The question remains how will the predictions be made from the states of the FA shown above? Well, to understand, we have to look at the table below:

State

Value

Strongly likely taken

11

Likely taken

10

Likely not taken

01

Strongly not likely taken

01

The table above shows the states in 22 bits. These bits help us predict the branch. The most significant bit (left bit) is used to predict whether the branch will be taken or not. If the bit is 11, it indicates that the branch will be taken, else not.

The 2-bit predictor achieves better accuracy with nested loops than the 1-bit predictor. We consider another example to understand the working of a 2-bit dynamic branch predictor. In this example, we run the outer loop 100100 times, and the inner loop runs till the a is less than 1010 and is reset to 00 in the outer loop.

1 of 15

The above example shows that in the outer loop's first iteration, the 2-bit predictor predicted false 33 times. However, in all the upcoming iterations, it predicted false only 11 time. The true and false predictions are shown in a simplified table below:

a

1st iteration

2nd iteration

3rd iteration

100th iteration

0

false

true

true

true

1

false

true

true

true

2

true

true

true

true

3

true

true

true

true

4

true

true

true

true

5

true

true

true

true

6

true

true

true

true

7

true

true

true

true

8

true

true

true

true

9

true

true

true

true

10

false

false

false

false

We can see a trend in the above table. Using this, we can calculate the accuracy of the 2-bit predictor as shown below:

This shows a significant improvement over the 1-bit predictor with an accuracy of 80%80\% in the example above. However, if we apply a 1-bit predictor to this example, the accuracy drops to 72.72%72.72\%.

Example

In this example, we'll write the code of the first example.

Note: The loop on line 35 increases the execution time to be easily readable.

#include <algorithm>
#include <ctime>
#include <iostream>
void SortData(int data[], int size)
{
for(int i = 0; i < size;i++)
{
for(int j = 0; j < size;j++)
{
if(data[j] > data[i])
{
int temp = data[j];
data[j] = data[i];
data[i] = temp;
}
}
}
}
int main()
{
int size = 10;
int data[size]; // Declare an array of size 10
for (int i = 0; i < size; i++)
{
data[i] = std::rand() % 100;
}
clock_t start = clock();
// Sort the data
SortData(data, size); // Feel free to comment it
for(int i = 0; i < 100000000 ; i++)
{
for (int j = 0; j < size; j++)
{
if (data[j] >= 50){}
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
}

Explanation

The explanation of the code is as follows:

  • Line 28: We add random data between and including 00 and 9999 to the data.

  • Line 33: The SortData() is called to sort the data in ascending order.

  • Line 39: The branching condition is implemented.

Time complexity

The time complexity is O(n2)O(n^2) because of the sort function. However, if we do not sort the data, the time complexity reduces O(n)O(n). This would suggest that the code will execute faster without sorting, but in reality, the opposite happens. The code with higher time complexity executes faster. The next question that comes to mind is how and why. Well, this is because of branch prediction.

Analysis

To understand the discrepancy in the run time, we must understand the problem more clearly.

The sorted array increases the accuracy of the branch predictor to 80%80\% compared to 40%40\% of the unsorted array. Every time a prediction is false, the pipeline is disturbed, and new instructions are fetched, resulting in performance wastage. This is how the sorted array code is executed faster despite having a higher time complexity.

Note: Run the code after commenting the SortData function call to see the execution time of unsorted code and then compare.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved