...

Implementing Parallel std::transform()

Learn about implementing parallel std::transform() with a naive approach, evaluate its performance, and understand the shortcomings of the implementation.

We'll cover the following...

Naive implementation
Performance evaluation
- Shortcomings of the naive implementation

Press + to interact

The standard library version also returns the dst iterator, but we will ignore that in our examples. To understand the challenges with a parallel version of std::transform(), let’s begin with a naive approach.

Naive implementation

A naive parallel implementation of std::transform() would probably look something like this:

Divide the elements into chunks corresponding to the number of cores in the computer
Process each chunk in a separate task
Wait for all tasks to finish

Using std::thread::hardware_concurrency() to determine the number of supported hardware threads, a possible implementation could look like this:

Press + to interact

template <typename SrcIt, typename DstIt, typename Func>
auto par_transform_naive(SrcIt first, SrcIt last, DstIt dst, Func f) {
  auto n = static_cast<size_t>(std::distance(first, last));
  auto n_cores = size_t{std::thread::hardware_concurrency()};
  auto n_tasks = std::max(n_cores, size_t{1});
  auto chunk_sz = (n + n_tasks - 1) / n_tasks;
  auto futures = std::vector<std::future<void>>{};
  // Process each chunk on a separate task
  for (auto i = 0ul; i < n_tasks; ++i) {
    auto start = chunk_sz * i;
    if (start < n) {
      auto stop = std::min(chunk_sz * (i + 1), n);
      auto fut = std::async(std::launch::async, [first, dst, start, stop, f]() {
        std::transform(first + start, first + stop, dst + start, f);
      });
      futures.emplace_back(std::move(fut));
    }
  }
  // Wait for each task to finish
  for (auto&& fut : futures) {
    fut.wait();
  }
}

Note that hardware_concurrency() might return $0$ if it, for some reason, is undetermined and therefore is clamped to be at least one.

A subtle difference between std::transform() and our parallel version is that they put different requirements on the iterators. std::transform() can operate on input and output iterators such as std::istream_iterator<> bound to std::cin. This is not possible with par_transform_naive() since the iterators are copied and used from multiple tasks. As will see, there are no parallel algorithms presented in this chapter that can operate on input and output iterators. Instead, the parallel algorithms at least require forward iterators that allow multi-pass traversal.

Performance evaluation

Continuing the naive implementation, let’s measure its performance with a simple performance evaluation compared to the sequential version of std::transform() executing at a single CPU core.

In this test, we will measure the time (clock on the wall) and the total time spent on the CPUs when varying the input size of the data.

We will set up this benchmark using Google Benchmark. To avoid duplicating code, we’ll implement a function to set up a test fixture for our benchmark. The fixture needs a source range with some example values, a destination range for the result, and a transform function:

Press + to interact

Getting Started

A Brief Introduction to C++

Essential C++ Techniques

Analyzing and Measuring Performance

Data Structures

Algorithms

Ranges and Views

Memory Management

C++ Essentials Exam

Compile-Time Programming

Essential Utilities

Proxy Objects and Lazy Evaluation

Concurrency

Coroutines and Lazy Generators

Asynchronous Programming with Coroutines

Parallel Algorithms

C++ Advanced Concepts Exam

Implementing Parallel std::transform()

Naive implementation

Performance evaluation