What is auto-vectorization?

Auto-vectorization is a technique used to optimize software programs by automatically modifying certain types of code to take advantage of the Single Instruction, Multiple Data (SIMD) unit, a feature of modern processors.

Note: The SIMD unit is a component of the processor (CPU) specifically designed to perform the same operation on multiple data elements simultaneously using a single instruction.

By doing this, the program can run faster and more efficiently, which is essential for applications that require processing large amounts of data and are computationally expensive, such as video games or scientific simulations.

How auto-vectorization works

When we write software programs, we might need to perform the same operation on a large set of data. An example would be if we have multiple lists of numbers, and we need to add all the numbers together. Traditionally, computers would perform this operation one number at a time, which can be slow if we have a lot of numbers to process.

One real-world example would be in image processing where merging two images together pixel by pixel could take a long time if done sequentially. We can achieve merging via the addition of image pixel values. By vectorizing the addition operation, multiple pixel values can be added together simultaneously, significantly reducing the processing time.

Auto-vectorization modifies the code to perform the same operation on multiple data elements simultaneously. The compiler, a software program that translates human-readable code to machine code, performs auto-vectorization during the compilation process, where it analyzes the code and considers the specific hardware and software configuration to determine whether the code can be vectorized and how best to do so. This optimization process involves modifying the code to take advantage of the SIMD unit in the most effective way possible

If the code is deemed suitable for vectorization, the compiler generates machine code that uses instructions to perform the computation. Auto-vectorization occurs during the compilation process, which means it is not visible in real-time during program execution. The code remains in vectorized form until it is executed on the CPU. However, some profiling tools can show whether the compiler has vectorized certain parts of the code, which can help developers understand the impact of vectorization.

Example

Now, let's go ahead and take a look at an example in the C language. We will look at a simplified version of the vectorization process via the following:


Original code

#include <stdio.h>
// define N as the number of elements in array
#define N 4
// add_array function: adds corresponding elements of arrays arr1 and arr2 and stores the result in sum
// add_array function will be vectorized
void add_arrays(float *arr1, float *arr2, float *sum)
{
// set loop to add elements until N
for (int i = 0; i < N; i++)
{
sum[i] = arr1[i] + arr2[i];
}
}
int main()
{
// initialize arrays arr1 and arr2 with some values
float arr1[N] = {1.0, 3.0, 3.0, 4.0};
float arr2[N] = {5.0, 4.0, 5.0, 4.0};
float sum[N];
//call the add function
add_arrays(arr1, arr2, sum);
//print the values in array sum
for (int i = 0; i < N; i++)
{
printf("%.2f + %.2f = %.2f\n", arr1[i], arr2[i], sum[i]);
}
return (0);
}

Vectorized code

The compiler analyzes the loop to determine if the loop is eligible for vectorization. In this case, the loop iterates over a one-dimensional array, and the addition operation can be performed using SIMD instructions. The compiler transforms the loop to make it compatible with SIMD instructions. For example, it may unroll the loop to access multiple elements in parallel.

#include <stdio.h>
#include <xmmintrin.h> // includes the SSE instruction set
#define N 4
void add_arrays(float* arr1, float* arr2, float* sum)
{
for (int i = 0; i < N; i+=4)
{
// set 128-bit (32-byte) variable data types
__m128 a, b, c;
// load element from memory
a = _mm_loadu_ps(&arr1[i]);
b = _mm_loadu_ps(&arr2[i]);
//add elements
c = _mm_add_ps(a, b);
//store from result into sum array
_mm_storeu_ps(&sum[i], c);
}
}
int main()
{
// initialize arrays arr1 and arr2 with some values
float arr1[N] = {1.0, 3.0, 3.0, 4.0};
float arr2[N] = {5.0, 4.0, 5.0, 4.0};
float sum[N];
//call the add function
add_arrays(arr1, arr2, sum);
//set loop to print each element from arrays until N
for (int i = 0; i < N; i++)
{
printf("%.2f + %.2f = %.2f\n", arr1[i], arr2[i], sum[i]);
}
return 0;
}

Explanation

  • Line 11: We declare three 128-bit variables using the __m128 data type.

  • Lines 14–15: The _mmloadu_ps function loads four floating point values into the __m128 variables.

  • Line 18: The original loop that added two arrays element-wise transforms to use the _mmadd_ps function, which performs Streaming SIMD Extensions (SSE) instructions that can add four floating-point values at once.

  • Line 21: The _mm_storeu_ps function stores four floating point values into the sum array.

Note: The SSE is an instruction set architecture that includes SIMD instructions to perform arithmetic and logical operations on floating point and integer data types. These instructions can operate on multiple data elements in parallel, which can significantly improve computationally intensive applications.

Conclusion

Auto-vectorization is a powerful software optimization technique that can significantly improve the performance of computationally intensive applications. By processing multiple data elements at once, auto-vectorization can improve the program's throughput and reduce the amount of data transferred between the CPU and memory. Ultimately, auto-vectorization harnesses the full potential of modern processors to produce faster and more efficient programs.

Free Resources