reduce with SIMD
I claim
std::accumulate
gets disproportionally more attention than
std::reduce
and
std::transform_reduce,
but that’s just a papercut not worth sharing. Let’s look at their
performance differences.
For the purpose of this post, I assume a reader is familiar with
floating point math, especially that operations that are normally
associative and commutative (addition and multiplication) are not
so with finite precision math. The C++ standard is justified to demand
that a certain order of operations is obeyed. I furthermore assume
that within the scope of this post, this doesn’t matter. I.e. when
dealing with a+b+c we can assume that all three are of similar size,
and while (a+b)+c will not be the same as a+(b+c) due to numeric
noise, we don’t care - there’s noise either way and we have to deal
with that. This means, we don’t care about the order of operations in
this post. We want to grant the compiler or library liberty to do
what’s most efficient.
Also, a bit of terminology, a function that takes two arguments is called “binary”. (One that takes one argument “unary”, one that takes no arguments “nullary”). Doesn’t really matter much, but don’t be confused when the word “binary” appears in this post and doesn’t refer to binary code or compiled binaries.
Let’s also get the gcc command line argument -fassociative-math
out of the way: we don’t want to throw that at our entire code base
and do unexpected (hopefully test-covered) changes, we don’t want to
refactor the build to encapsulate our function into a single
translation unit to apply the flag (might cause trouble with
templates), and we even assume our math operation is a bit on the
tricky side and has nested floating point operations, not all of which
should be associative.
accumulate just a few numbers
Without checking, I claim Matt Godbolt covered accumulating integers in Advent of Compiler Optimisation.
EDIT: I should’ve checked because Matt actually discusses the change from int to float on day 21.
In this snippet
|
|
I use std::accumulate to sum 16 integers, and in the assembly
|
|
we can see 4 additions. Without tracing too much what the code does, this should be the minimally required operations: first add 8 and 8 integers, then 4 and 4, then 2 and 2, then 1 and 1. There’s no more independent operations.
When just changing the entire thing to float, the situation is different
|
|
This is 16 additions for just 16 floats. The vaddss instructions are
a bit misleading, we use vectorized instructions but still end up with
the maximal number of operations (actually we know at compile time
that the input range isn’t empty and still add 0.f to the sum, we
could call std::accumulate(s.begin() + s.end(), s[0]) and would save
one addition).
reduce
Sadly, when changing accumulate for reduce, the assembly becomes more complex. I suspect the compiler / library doesn’t manage to actually profit from the compile-time known length of the input range.
I tried to run this example on quickbench, but quickbench doesn’t seem to have boost installed.
The idea being to run on 2048 random floats, add them, and benchmark
std::accumulate, std::reduce with seq execution policy, and
std::reduce with unseq execution policy.1
Originally, wrote this for std::vector, though now also ran with std::array.
The output looks something like that for both array and vector:
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
add_accumulate/real_time/threads:1 2229 ns 2228 ns 312562
add_accumulate/real_time/threads:2 2234 ns 2232 ns 310696
add_accumulate/real_time/threads:4 2226 ns 2225 ns 313700
add_reduce_unseq/real_time/threads:1 140 ns 140 ns 5006703
add_reduce_unseq/real_time/threads:2 139 ns 139 ns 5033388
add_reduce_unseq/real_time/threads:4 139 ns 139 ns 5051816
add_reduce/real_time/threads:1 554 ns 554 ns 1239091
add_reduce/real_time/threads:2 566 ns 566 ns 1237850
add_reduce/real_time/threads:4 571 ns 571 ns 1214188
I.e. std::reduce with std::execution::unseq is (with my compiler,
standard library, and hardware) almost twenty times faster than
std::accumulate and std::execution::unseq still gains a factor 4
over std::execution::seq.
constraining a compiler
I like to see this as effect of the constraints that we impose on the
compiler. With std::accumulate, the order of operations is set and
the compiler is not allowed to deviate (unless the optimiser can be
user that the order of operations won’t change the result - as is the
case for integers). With std::reduce, the optimizer can re-arrange
the additions. From the point of view of someone who worked a lot with
SIMD vectorization, clearly the optimizer will use the vector units to
run as many additions in parallel as fit the unit. More generally
speaking, I see this as “fewer constraints means more options for
optimizations, means more optimal code”.
Can we do better?
(One nice thing about having worked at CERN is that talking publicly about my work was what iI got paid for - going to conferences, publishing papers. And our code was open source - GPL v2 - so I can just pick headers from my day job back then and blog about it.)
Today, I opened the blog post with additions because that’s what Matt used in aoco. My original benchmark used “find the minimum in a range of numbers”.
Instead of using 2048 floats, we can use 256 packed floats with 8 numbers each. I used the SIMDWrapper from LHCbMath, though I forked pretty long ago. Despite the long header, there’s not too much magic happening. For the purpose of this post, the interesting bits are:
- The header is an abstraction across multiple SIMD instruction sets such that multiple target architectures can be targeted
- each
SIMDWrapper::$namespace::float_vis a struct around__m256,__m128,float, or whatever is deemed the correct packed float for the namespace - the types come with all of the regular math operation that one expects of floats (operators
+,-… functions likefriend sqrt,friend minfor argument dependent lookup, …) - the types can be assigned from
__m256and such.
With this we can face a wall of code. We can generate a 2048 random
floats, either in an array of 2048 floats, or as 256 __m256. I pack
this in a separate translation unit to avoid that the compiler
outsmarts me and determines the random numbers at compile time.
|
|
Now the different “find minimum” implementations
|
|
They are just what we had before, std::accumulate and std::reduce
with seq and unseq. For the fun of it one more, where the workflow
is transposed. We first do an hmin to find the minimal number within
each __m256 and then determine the minimum of the resulting 256
floats.
|
|
And one using std::min_element. I’m torn. Quite often, I find
find, min, max functions that only find a certain element by
value, but not their position in the container rather useless, here
the situation is reversed, the std::min_element function gives us
the minimal element, but the compiler needlessly keeps a reference
around instead of just the value.
|
|
The results are
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
float_accumulate/real_time 2253 ns 2252 ns 310702
float_reduce_unseq/real_time 406 ns 406 ns 1722730
float_reduce/real_time 567 ns 567 ns 1237053
m256_accumulate/real_time 242 ns 242 ns 2896859
m256_reduce_unseq/real_time 119 ns 119 ns 5862106
m256_reduce/real_time 58.8 ns 58.8 ns 11631224
transposed/real_time 242 ns 242 ns 2874079
min_element/real_time 2219 ns 2219 ns 316111
With float, the situation is similar-ish to the initial example
(accumulate slowest, then execute+seq, then
execute+unseq). Using __m256, the situation is
different. accumulate is still slowest, but then execute+seq is
faster than execute+unseq. I don’t like that … but it is as it
is. It seems like the compiler and standard library are not magic
“best possible code” machines and releasing the seq constraint to
unseq incurs a slowdown. My explanation is that vectorization of
multiple binary operations fails when the binary operation is
non-vectorizable, e.g. because it already uses the vetor unit. Now a
futile optimisation to interleave operations or re-arrange operations
slows the computation down.
What is still worth pointing out:
- all
__m256versions are faster than allfloatversions, even though the data layout should be exactly the same, and thefloatreduceoptions should’ve been able to yield the same code as any of the__m256versions. - the
transposedis about as fast as__m256accumulate.
The repo with the code is on codeberg.
potential followup
This should be a nice segue into horizontal vs vertical vectorisation. Needs to be seen if I take the time and work on a followup.
PS
- As you noticed, I didn’t look too much at the assembly. Truly seems to me like there’s something suboptimal in my compiler / standard library
- If you look at the repo, you’ll notice I have old results from my previous laptop. Of cause I should’ve written down which standard library version and compiler these were. I should’ve given better names. And it does look like I had benchmarks where
accumulatewas the best.
-
As pointed out by Billy O’Neal,
reducealways assumes associativity of the binary operation.unseqandseqonly differ in whether the binary operation can be interleaved with itself. See cppreference.com ↩︎