...

/

Implementing Parallel std::transform()

Implementing Parallel std::transform()

Learn about implementing parallel std::transform() with a naive approach, evaluate its performance, and understand the shortcomings of the implementation.

Although algorithmically, std::transform() is easy to implement, in practice, implementing even a rudimentary parallel version is more complex than it might appear at first sight.

The algorithm std::transform() calls a function for each element in a sequence and stores the result in another sequence. A possible implementation of a sequential version of std::transform() may look something like this:

Press + to interact
template <class SrcIt, class DstIt, class Func>
auto transform(SrcIt first, SrcIt last, DstIt dst, Func func) {
while (first != last) {
*dst++ = func(*first++);
}
}

The standard library version also returns the dst iterator, but we will ignore that in our examples. To understand the challenges with a parallel version of std::transform(), let’s begin with a naive approach.

Naive implementation

A naive parallel implementation of std::transform() would probably look something like this:

  • Divide the elements into chunks corresponding to the number of cores in the computer

  • Process each chunk in a separate task

  • Wait for all tasks to finish

Using std::thread::hardware_concurrency() to determine the number of supported hardware threads, a possible implementation could look like this:

Press + to interact
template <typename SrcIt, typename DstIt, typename Func>
auto par_transform_naive(SrcIt first, SrcIt last, DstIt dst, Func f) {
auto n = static_cast<size_t>(std::distance(first, last));
auto n_cores = size_t{std::thread::hardware_concurrency()};
auto n_tasks = std::max(n_cores, size_t{1});
auto chunk_sz = (n + n_tasks - 1) / n_tasks;
auto futures = std::vector<std::future<void>>{};
// Process each chunk on a separate task
for (auto i = 0ul; i < n_tasks; ++i) {
auto start = chunk_sz * i;
if (start < n) {
auto stop = std::min(chunk_sz * (i + 1), n);
auto fut = std::async(std::launch::async, [first, dst, start, stop, f]() {
std::transform(first + start, first + stop, dst + start, f);
});
futures.emplace_back(std::move(fut));
}
}
// Wait for each task to finish
for (auto&& fut : futures) {
fut.wait();
}
}

Note that hardware_concurrency() might return 00 if it, for some reason, is undetermined and therefore is clamped to be at least one.

A subtle difference between std::transform() and our parallel version is that they put different requirements on the iterators. std::transform() can operate on input and output iterators such as std::istream_iterator<> bound to std::cin. This is not possible with par_transform_naive() since the iterators are copied and used from multiple tasks. As will see, there are no parallel algorithms presented in this chapter that can operate on input and output iterators. Instead, the parallel algorithms at least require forward iterators that allow multi-pass traversal.

Performance evaluation

Continuing the naive implementation, let’s measure its performance with a simple performance evaluation compared to the sequential version of std::transform() executing at a single CPU core.

In this test, we will measure the time (clock on the wall) and the total time spent on the CPUs when varying the input size of the data.

We will set up this benchmark using Google Benchmark. To avoid duplicating code, we’ll implement a function to set up a test fixture for our benchmark. The fixture needs a source range with some example values, a destination range for the result, and a transform function:

Press + to interact
auto setup_fixture(size_t n) {
auto src = std::vector<float>(n);
std::iota(src.begin(), src.end(), 1.0f); // "src" goes from 1.0 to n
auto dst = std::vector<float>(src.size());
auto transform_func = [](float v) {
auto sum = v;
for (auto i = 0; i < 500; ++i) {
sum += (i * i * i * sum);
}
return sum;
};
return std::tuple{src, dst, transform_func};
}

Now we have our fixture set up, it’s time to implement the actual benchmark. There will be two versions: one for the sequential std::transform() and one for our parallel version, par_transform_naive():

Press + to interact
void bm_sequential(benchmark::State& state) {
auto [src, dst, f] = setup_fixture(state.range(0));
for (auto _ : state) {
std::transform(src.begin(), src.end(), dst.begin(), f);
}
}
void bm_parallel(benchmark::State& state) {
auto [src, dst, f] = setup_fixture(state.range(0));
for (auto _ : state) {
par_transform_naive(src.begin(), src.end(), dst.begin(), f);
}
}

Only the code within the for-loops will be measured. Using state.range(0) for input size, we can generate different values by appending a range of values to each benchmark. In fact, we need to specify a couple of arguments for each benchmark, so we create a helper function that applies all the settings we need:

Press + to interact
void CustomArguments(benchmark::internal::Benchmark* b) {
b->Arg(50)->Arg(5000)->Arg(50'000)->Arg(100'000)->Arg(10'000'000)
->MeasureProcessCPUTime()
->UseRealTime()
->Unit(benchmark::kMillisecond);
}

A few things to note about the custom arguments:

  • We pass the values 50, 10,000, and 1,000,000 as arguments to the benchmark. They are used as the input size when creating the vectors in the setup_fixture() ...