Implementing Parallel std::transform()
Learn about implementing parallel std::transform() with a naive approach, evaluate its performance, and understand the shortcomings of the implementation.
We'll cover the following...
Although algorithmically, std::transform()
is easy to implement, in practice, implementing even a rudimentary parallel version is more complex than it might appear at first sight.
The algorithm std::transform()
calls a function for each element in a sequence and stores the result in another sequence. A possible implementation of a sequential version of std::transform()
may look something like this:
template <class SrcIt, class DstIt, class Func>auto transform(SrcIt first, SrcIt last, DstIt dst, Func func) {while (first != last) {*dst++ = func(*first++);}}
The standard library version also returns the dst
iterator, but we will ignore that in our examples. To understand the challenges with a parallel version of std::transform()
, let’s begin with a naive approach.
Naive implementation
A naive parallel implementation of std::transform()
would probably look something like this:
-
Divide the elements into chunks corresponding to the number of cores in the computer
-
Process each chunk in a separate task
-
Wait for all tasks to finish
Using std::thread::hardware_concurrency()
to determine the number of supported hardware threads, a possible implementation could look like this:
template <typename SrcIt, typename DstIt, typename Func>auto par_transform_naive(SrcIt first, SrcIt last, DstIt dst, Func f) {auto n = static_cast<size_t>(std::distance(first, last));auto n_cores = size_t{std::thread::hardware_concurrency()};auto n_tasks = std::max(n_cores, size_t{1});auto chunk_sz = (n + n_tasks - 1) / n_tasks;auto futures = std::vector<std::future<void>>{};// Process each chunk on a separate taskfor (auto i = 0ul; i < n_tasks; ++i) {auto start = chunk_sz * i;if (start < n) {auto stop = std::min(chunk_sz * (i + 1), n);auto fut = std::async(std::launch::async, [first, dst, start, stop, f]() {std::transform(first + start, first + stop, dst + start, f);});futures.emplace_back(std::move(fut));}}// Wait for each task to finishfor (auto&& fut : futures) {fut.wait();}}
Note that hardware_concurrency()
might return if it, for some reason, is undetermined and therefore is clamped to be at least one.
A subtle difference between std::transform()
and our parallel version is that they put different requirements on the iterators. std::transform()
can operate on input and output iterators such as std::istream_iterator<>
bound to std::cin
. This is not possible with par_transform_naive()
since the iterators are copied and used from multiple tasks. As will see, there are no parallel algorithms presented in this chapter that can operate on input and output iterators. Instead, the parallel algorithms at least require forward iterators that allow multi-pass traversal.
Performance evaluation
Continuing the naive implementation, let’s measure its performance with a simple performance evaluation compared to the sequential version of std::transform()
executing at a single CPU core.
In this test, we will measure the time (clock on the wall) and the total time spent on the CPUs when varying the input size of the data.
We will set up this benchmark using Google Benchmark. To avoid duplicating code, we’ll implement a function to set up a test fixture for our benchmark. The fixture needs a source range with some example values, a destination range for the result, and a transform function:
auto setup_fixture(size_t n) {auto src = std::vector<float>(n);std::iota(src.begin(), src.end(), 1.0f); // "src" goes from 1.0 to nauto dst = std::vector<float>(src.size());auto transform_func = [](float v) {auto sum = v;for (auto i = 0; i < 500; ++i) {sum += (i * i * i * sum);}return sum;};return std::tuple{src, dst, transform_func};}
Now we have our fixture set up, it’s time to implement the actual benchmark. There will be two versions: one for the sequential std::transform()
and one for our parallel version, par_transform_naive()
:
void bm_sequential(benchmark::State& state) {auto [src, dst, f] = setup_fixture(state.range(0));for (auto _ : state) {std::transform(src.begin(), src.end(), dst.begin(), f);}}void bm_parallel(benchmark::State& state) {auto [src, dst, f] = setup_fixture(state.range(0));for (auto _ : state) {par_transform_naive(src.begin(), src.end(), dst.begin(), f);}}
Only the code within the for
-loops will be measured. Using state.range(0)
for input size, we can generate different values by appending a range of values to each benchmark. In fact, we need to specify a couple of arguments for each benchmark, so we create a helper function that applies all the settings we need:
void CustomArguments(benchmark::internal::Benchmark* b) {b->Arg(50)->Arg(5000)->Arg(50'000)->Arg(100'000)->Arg(10'000'000)->MeasureProcessCPUTime()->UseRealTime()->Unit(benchmark::kMillisecond);}
A few things to note about the custom arguments:
-
We pass the values 50, 10,000, and 1,000,000 as arguments to the benchmark. They are used as the input size when creating the vectors in the
setup_fixture()
...