04 Jan 2026

reduce with SIMD

I claim std::accumulate gets disproportionally more attention than std::reduce and std::transform_reduce, but that’s just a papercut not worth sharing. Let’s look at their performance differences.

For the purpose of this post, I assume a reader is familiar with floating point math, especially that operations that are normally associative and commutative (addition and multiplication) are not so with finite precision math. The C++ standard is justified to demand that a certain order of operations is obeyed. I furthermore assume that within the scope of this post, this doesn’t matter. I.e. when dealing with a+b+c we can assume that all three are of similar size, and while (a+b)+c will not be the same as a+(b+c) due to numeric noise, we don’t care - there’s noise either way and we have to deal with that. This means, we don’t care about the order of operations in this post. We want to grant the compiler or library liberty to do what’s most efficient.

Also, a bit of terminology, a function that takes two arguments is called “binary”. (One that takes one argument “unary”, one that takes no arguments “nullary”). Doesn’t really matter much, but don’t be confused when the word “binary” appears in this post and doesn’t refer to binary code or compiled binaries.

Let’s also get the gcc command line argument -fassociative-math out of the way: we don’t want to throw that at our entire code base and do unexpected (hopefully test-covered) changes, we don’t want to refactor the build to encapsulate our function into a single translation unit to apply the flag (might cause trouble with templates), and we even assume our math operation is a bit on the tricky side and has nested floating point operations, not all of which should be associative.

accumulate just a few numbers

Without checking, I claim Matt Godbolt covered accumulating integers in Advent of Compiler Optimisation.

EDIT: I should’ve checked because Matt actually discusses the change from int to float on day 21.

In this snippet

1
2
3


int some_sum(std::span<int, 16> s) {
    return std::accumulate(s.begin(), s.end(), 0);
}

I use std::accumulate to sum 16 integers, and in the assembly

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


some_sum(std::span<int, 16ul>):
        vmovdqu ymm1, YMMWORD PTR [rdi]
        vpaddd  ymm1, ymm1, YMMWORD PTR [rdi+32]
        vextracti128    xmm0, ymm1, 0x1
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 8
        vpaddd  xmm0, xmm0, xmm1
        vpsrldq xmm1, xmm0, 4
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        vzeroupper
        ret

we can see 4 additions. Without tracing too much what the code does, this should be the minimally required operations: first add 8 and 8 integers, then 4 and 4, then 2 and 2, then 1 and 1. There’s no more independent operations.

When just changing the entire thing to float, the situation is different

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


some_sum(std::span<float, 16ul>):
        vxorps  xmm0, xmm0, xmm0
        vaddss  xmm0, xmm0, DWORD PTR [rdi]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+4]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+8]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+12]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+16]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+20]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+24]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+28]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+32]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+36]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+40]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+44]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+48]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+52]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+56]
        vaddss  xmm0, xmm0, DWORD PTR [rdi+60]
        vcvttss2si      eax, xmm0
        ret

This is 16 additions for just 16 floats. The vaddss instructions are a bit misleading, we use vectorized instructions but still end up with the maximal number of operations (actually we know at compile time that the input range isn’t empty and still add 0.f to the sum, we could call std::accumulate(s.begin() + s.end(), s[0]) and would save one addition).

reduce

Sadly, when changing accumulate for reduce, the assembly becomes more complex. I suspect the compiler / library doesn’t manage to actually profit from the compile-time known length of the input range.

I tried to run this example on quickbench, but quickbench doesn’t seem to have boost installed.

The idea being to run on 2048 random floats, add them, and benchmark std::accumulate, std::reduce with seq execution policy, and std::reduce with unseq execution policy.¹

Originally, wrote this for std::vector, though now also ran with std::array.

The output looks something like that for both array and vector:

-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
add_accumulate/real_time/threads:1         2229 ns         2228 ns       312562
add_accumulate/real_time/threads:2         2234 ns         2232 ns       310696
add_accumulate/real_time/threads:4         2226 ns         2225 ns       313700
add_reduce_unseq/real_time/threads:1        140 ns          140 ns      5006703
add_reduce_unseq/real_time/threads:2        139 ns          139 ns      5033388
add_reduce_unseq/real_time/threads:4        139 ns          139 ns      5051816
add_reduce/real_time/threads:1              554 ns          554 ns      1239091
add_reduce/real_time/threads:2              566 ns          566 ns      1237850
add_reduce/real_time/threads:4              571 ns          571 ns      1214188

I.e. std::reduce with std::execution::unseq is (with my compiler, standard library, and hardware) almost twenty times faster than std::accumulate and std::execution::unseq still gains a factor 4 over std::execution::seq.

constraining a compiler

I like to see this as effect of the constraints that we impose on the compiler. With std::accumulate, the order of operations is set and the compiler is not allowed to deviate (unless the optimiser can be user that the order of operations won’t change the result - as is the case for integers). With std::reduce, the optimizer can re-arrange the additions. From the point of view of someone who worked a lot with SIMD vectorization, clearly the optimizer will use the vector units to run as many additions in parallel as fit the unit. More generally speaking, I see this as “fewer constraints means more options for optimizations, means more optimal code”.

Can we do better?

(One nice thing about having worked at CERN is that talking publicly about my work was what iI got paid for - going to conferences, publishing papers. And our code was open source - GPL v2 - so I can just pick headers from my day job back then and blog about it.)

Today, I opened the blog post with additions because that’s what Matt used in aoco. My original benchmark used “find the minimum in a range of numbers”.

Instead of using 2048 floats, we can use 256 packed floats with 8 numbers each. I used the SIMDWrapper from LHCbMath, though I forked pretty long ago. Despite the long header, there’s not too much magic happening. For the purpose of this post, the interesting bits are:

The header is an abstraction across multiple SIMD instruction sets such that multiple target architectures can be targeted
each SIMDWrapper::$namespace::float_v is a struct around __m256, __m128, float, or whatever is deemed the correct packed float for the namespace
the types come with all of the regular math operation that one expects of floats (operators +, - … functions like friend sqrt, friend min for argument dependent lookup, …)
the types can be assigned from __m256 and such.

With this we can face a wall of code. We can generate a 2048 random floats, either in an array of 2048 floats, or as 256 __m256. I pack this in a separate translation unit to avoid that the compiler outsmarts me and determines the random numbers at compile time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


template <typename T>
typename std::conditional<
    std::is_same_v<T, float>, std::array<float, 8 * 256>,
    std::array<SIMDWrapper::avx2::float_v, 256>>::type
gen() {
  if constexpr (std::is_same_v<T, float>) {
    std::array<float, 8 * 256> store{};
    for (std::size_t i = 0; i < 8 * 256; ++i) {
      store[i] = random_gen();
    }
    return store;
  } else {
    std::array<SIMDWrapper::avx2::float_v, 256> store{};
    for (std::size_t i = 0; i < 256; ++i) {
      store[i] = _mm256_set_ps(
          random_gen(), random_gen(), random_gen(), random_gen(), random_gen(),
          random_gen(), random_gen(), random_gen());
    }
    return store;
  }
}

Now the different “find minimum” implementations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


static void float_accumulate(benchmark::State& state) {
  auto store = gen<float>();
  for (auto _ : state) {
    benchmark::DoNotOptimize(
        std::accumulate(
            std::begin(store), std::end(store),
            std::numeric_limits<float>::max(),
            [](auto x, auto y) { return std::min(x, y); }));
  }
}
static void float_reduce(benchmark::State& state) {
  auto store = gen<float>();
  for (auto _ : state) {
    benchmark::DoNotOptimize(
        std::reduce(
            std::execution::seq, std::begin(store), std::end(store),
            std::numeric_limits<float>::max(),
            [](auto x, auto y) { return std::min(x, y); }));
  }
}
static void float_reduce_unseq(benchmark::State& state) {
  auto store = gen<float>();
  for (auto _ : state) {
    benchmark::DoNotOptimize(
        std::reduce(
            std::execution::unseq, std::begin(store), std::end(store),
            std::numeric_limits<float>::max(),
            [](auto x, auto y) { return std::min(x, y); }));
  }
}
static void m256_accumulate(benchmark::State& state) {
  auto store = gen<__m256>();
  using std::min;
  for (auto _ : state) {
    benchmark::DoNotOptimize(
        std::accumulate(
            std::begin(store), std::end(store),
            std::numeric_limits<SIMDWrapper::avx2::float_v>::max(),
            [](auto x, auto y) { return min(x, y); }));
  }
}
static void m256_reduce(benchmark::State& state) {
  auto store = gen<__m256>();
  using std::min;
  for (auto _ : state) {
    benchmark::DoNotOptimize(
        std::reduce(
            std::execution::seq, std::begin(store), std::end(store),
            std::numeric_limits<SIMDWrapper::avx2::float_v>::max(),
            [](auto x, auto y) { return min(x, y); }));
  }
}
static void m256_reduce_unseq(benchmark::State& state) {
  auto store = gen<__m256>();
  using std::min;
  for (auto _ : state) {
    benchmark::DoNotOptimize(
        std::reduce(
            std::execution::unseq, std::begin(store), std::end(store),
            std::numeric_limits<SIMDWrapper::avx2::float_v>::max(),
            [](auto x, auto y) { return min(x, y); }));
  }
}

They are just what we had before, std::accumulate and std::reduce with seq and unseq. For the fun of it one more, where the workflow is transposed. We first do an hmin to find the minimal number within each __m256 and then determine the minimum of the resulting 256 floats.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


static void transposed(benchmark::State& state) {
  auto store = gen<__m256>();
  for (auto _ : state) {
    using std::min;
    benchmark::DoNotOptimize(
        std::transform_reduce(
            std::execution::unseq, store.begin(), store.end(),
            std::numeric_limits<float>::max(),
            [](const auto& a, const auto& b) { return min(a, b); },
            [](const auto& v) -> float { return v.hmin(); }));
  }
}

And one using std::min_element. I’m torn. Quite often, I find find, min, max functions that only find a certain element by value, but not their position in the container rather useless, here the situation is reversed, the std::min_element function gives us the minimal element, but the compiler needlessly keeps a reference around instead of just the value.

1
2
3
4
5
6
7
8


static void min_element(benchmark::State& state) {
  auto store = gen<float>();
  for (auto _ : state) {
    using std::min;
    benchmark::DoNotOptimize(
        *std::min_element(std::execution::unseq, store.begin(), store.end()));
  }
}

The results are

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
float_accumulate/real_time         2253 ns         2252 ns       310702
float_reduce_unseq/real_time        406 ns          406 ns      1722730
float_reduce/real_time              567 ns          567 ns      1237053
m256_accumulate/real_time           242 ns          242 ns      2896859
m256_reduce_unseq/real_time         119 ns          119 ns      5862106
m256_reduce/real_time              58.8 ns         58.8 ns     11631224
transposed/real_time                242 ns          242 ns      2874079
min_element/real_time              2219 ns         2219 ns       316111

With float, the situation is similar-ish to the initial example (accumulate slowest, then execute+seq, then execute+unseq). Using __m256, the situation is different. accumulate is still slowest, but then execute+seq is faster than execute+unseq. I don’t like that … but it is as it is. It seems like the compiler and standard library are not magic “best possible code” machines and releasing the seq constraint to unseq incurs a slowdown. My explanation is that vectorization of multiple binary operations fails when the binary operation is non-vectorizable, e.g. because it already uses the vetor unit. Now a futile optimisation to interleave operations or re-arrange operations slows the computation down.

What is still worth pointing out:

all __m256 versions are faster than all float versions, even though the data layout should be exactly the same, and the float reduce options should’ve been able to yield the same code as any of the __m256 versions.
the transposed is about as fast as __m256 accumulate.

The repo with the code is on codeberg.

potential followup

This should be a nice segue into horizontal vs vertical vectorisation. Needs to be seen if I take the time and work on a followup.

PS

As you noticed, I didn’t look too much at the assembly. Truly seems to me like there’s something suboptimal in my compiler / standard library
If you look at the repo, you’ll notice I have old results from my previous laptop. Of cause I should’ve written down which standard library version and compiler these were. I should’ve given better names. And it does look like I had benchmarks where accumulate was the best.

As pointed out by Billy O’Neal, reduce always assumes associativity of the binary operation. unseq and seq only differ in whether the binary operation can be interleaved with itself. See cppreference.com ↩︎

Yet another blog restart