SIMD with more than one argument
Let’s look at 2025’s Advent of Code Day 8 Puzzle. We have a list of points in 3D and need to compute all pairwise distances1.
Contrary to the AoC puzzle, I’ll work for the start in 1D with floating point coordinates. We need something like
|
|
Before your brain wanders off:
- Range based loops would probably be nicer, but I’m not aware of a standard library equivalent of combinations.
- For optimization purposes, it’s not obvious to me in which order the 2D space of
oneandtwoshould be traversed. Let’s hope for now the optimizer can do what it thinks is best - The signature of
one_level_higher_in_the_call_stackand the handling of thedistancereturn value also has large impact on how optimal the code is, but the draft for this post is already way too long and too all-over-the-place. - Actually, this is way more complicated than what I want to discuss now.
Let’s consider only the inner loop
|
|
This is not the most common pattern of a vectorised function call. Of cause, when the definition of dist and that loop are in the same translation unit, then the optimiser will inline dist and then read chunks of 2, 4, 8 or more from xs and write chunks of these vector sizes to some_buffer. But if dist was too complex to inline or even in a different translation unit, it’s less obvious what will happen.
Consider my previous post about leveraging the auto-vectoriser with symbols in different translation units.
With a dist definition like this (compiler-explorer link)
|
|
the following symbols
|
|
Here, both x_1 and x_2 are packed floats. They are good for a call v[i] = dist(x_1[i], x_2[i]) with different x_1 and x_2 input ranges.
We can build a call site like that (compiler-explorer link)
|
|
and the vectorized version gets called.
|
|
Neat! We need v[i] = dist(x_1, x_2[i]) at the call site (one scalar and one
packed argument), the implementation provided only a version with two packed
arguments, and the compiler used a broadcast on the call site to match the
interface.
And I think this even has a certain appeal. Chances are that any fun(scalar, vector[i]) function actually needs to do a broadcast of the scalar internally
for the low level binary (binary = taking two arguments) instructions. With the broadcast on the call site, we could avoid duplicated work, if the scalar argument got provided to multiple functions:
|
|
|
|
Admittedly, the broadcast is probably cheap but more a general concern that one can consider to expose the internally needed types in a function interface, such that the caller sees them and overall back-and-forth conversions can be avoided.
possible applications
Cool, so we have a function float fun(float, float) and a vectorised version __m256 fun(__m256, __m256) can be used for the three use cases
v[i] = fun(x[i], y[i])v[i] = fun(x, y[i])v[i] = fun(x[i], y)by just broadcasting the scalar argument.
limitations
At the time of writing, I haven’t come up with an example, but I could imagine that a dedicated scalar+vector argument function could be implemented more optimally than a broadcast-scalar+vector implementation.
And here I went down a rabbit hole. With __attribute__((const, simd("notinbranch"))), I can not specify which of the arguments should be
packed and which should be scalar. Is this a limitation of the gcc
attribute
OR is this a limiation of the x84 ABI that the gcc attribute exposes?
closer look at the ABI
The ABI documentation actually answers that question more helpfully than I expected! With examples - which highlight that I should’ve looked at openmp:
|
|
gives the following symbols
|
|
Compared to before, we have twice as many symbols, but only because the
compiler generated masked and unmasked versions. The other difference is that
we have 8uv instead of 8vv in the end of the vectorisation specification.
Meaning that the first argument is uniform.
Cool, should’ve discovered that earlier2.
More generalization
Going back, closer to the original AoC puzzle though, we have two argument ranges with different indices:
|
|
I know mdspan exists, but please pretend this was just quasi-code, the memory
layout of the return is a bit distracting. Let’s pretend also that the two
arguments might not come from the same array (after all, for sufficiently
different i and j, they are just different memory addresses):
|
|
In terms of linear algebra, this reminds me of the question “what is vector times vector?”. Whereas vector-transposed times vector is a scalar, vector times vector-transposed is a matrix, and in numerics software one often needs element-wise products, such that vector times vector is another vector.
Similarly, when I ask for a vecorized version of float fun(float, float), I am thinking of the possibilities
v[i] = fun(x[i], y[i])(what the gcc attribute generates)v[i] = fun(x, y[i])v[i] = fun(x[i], y)v[i,j] = fun(x[i], y[j])
Though the similarity isn’t great. The linear algebra scalar product needs another reduce somewhere.
It also came to my mind that in differential geometry and/or general relativity, a confusion between covariant and contravariant indices has similar ambiguity:
- Σ
x[i] y[i] - or
x[i] y[j]
EDIT: a section that fell to the cutting room floor
With the above ingredients we can provide these three functions to the vectoriser:
v[i] = fun(x[i], y[i])(OpenMP pragma or gcc attribute)v[i] = fun(x, y[i])(OpenMP pragma or broadcast + gcc attribute)v[i] = fun(x[i], y)(OpenMP pragma or broadcast + gcc attribute)
And a v[i,j] = fun(x[i], y[j]) call could be done with nested loops
|
|
Now I’m making a the assumption that the optimizer sees through our gymnastics,
recognizes “this is a 2D loop”, and reorders as it sees fit. For that, it
would be nice to know if either of the two half-vectorised fun
implementations is more or less efficient than the other to include that in the
optimizer’s cost function. There is no way the compiler of the fun definition
can pass that information to the compiler of the call site. The compiler of the
call site could do both versions and let LTO figure it out??? Or the programmer
could add that information to the declaration (would require the programmer to
know though).
In either case, the output format is probably the bigger concern.
END EDIT
JUMP: where i had encountered this issue before
This ambiguity has bitten us during my time at CERN.
We had a function that was vectorized. Not with any of the techniques above. The function was templated, where one of the arguments could be a scalar or packed float, depending on the instantiation, compiler flags, …
|
|
This was for the computation of v[i] = function_name(x, y[i]). Now I had made
the fateful change that would allow vectorisation in the other (intended only
for the other) dimension v[i] = function_name(x[i], y).
|
|
And thanks to our SIMD framework, all combinations of scalar and vector arguments for basic math binary operations were implemented. Such that the template definition could be instantiated for these three combinations:
function_name(x, y[i])function_name(x[i], y)function_name(x[i], y[i])
Intended application
Strongly simplifying, at the place in the code where I was working, the reconstruction software had already found a bunch of points in 3D and a bunch of lines in 3D.3 And for some applications we needed “for this given line, compute the distance to all points, and return the minimal one. Do this for all lines. We want the minimal distance of every lines”.4 (These distances are also called “impact parameter”).
Others needed “for this given point, compute the distances of all lines. And do that for all points. We want for each point the close-by lines”.
where things went working
Ofc, the call sites did implement 2D loops:
|
|
Though, let’s rewrite that:
|
|
The innocent issue is hidden in the fact that lines and points were written
for SIMD application and begin() and end() gave iterators with stride
length 2, 4, 8, … with x, y, z being __256 (or similar) data types to
represent 2, 4, 8, … lines or points.
Before my change, points could only do stride length of one. After my change,
both lines and points had stride length of 2, 4, 8, … who really remembers,
let’s say stride length 8.
And as a result, we computed for line 0 the distances to points 0, 8, 16, 24, 32, 40, …. For line 1, we computed the distance to points 1, 9, 17, 25, 33, 41, ….
Instead of filling a matrix with all the line-point distances, we only filled a matrix composed of 8-by-8 diagonal matrices.
Unit tests to the rescue
While build time tests completely missed that I fucked up the dimensions of the data, unit tests quickly spotted that the impact parameter computation was far off. (Looking at line 0, the minimal IP was now not the minimum over distances to points 0, 1, 2, 3, 4, … but only over 0, 8, 16, … and thus way larger).
Conclusion
None so far. It seems to me that it was a flaw of the framework that lets one
build the API of dist in a way that vectorises correctly over either argument
but incorrectly - unintendedly - over both. The fact that the arguments are
templated obfuscates the matter further. Of cause we could’ve added a SFINAE
check that prevents double-vectorization, but what I would’ve really wanted is
a way to say “these two use different indices”.
But even if we had had something for that, one hits the issue of output types.
When I want to implement v[i,j] = fun(x[i], y[j]), then the memory layout of
the output is far from trivial - whether it’s row-major or column-major, the
stride between columns or rows is unknown and the output memory has gaps.
And it all makes me beg for “can’t the compiler’s optimizer figure this out for me?”.
There already is an unseq version of some standard library functions,
including for_each. But I don’t see how to do nested loops / 2D loops with it
and let the compiler decide “you figure out which should be the inner loop”.
Not mentioned
Above, I skip over the part where we go from a single float argument to a line in 3D. We did quite some work in that part, and I should say “they” instead of “we” because truly the memory management for AVX-512 by Arthur Hennequin was pretty impressive and then there were the PrZip SIMD iterators (not even sure by whom). I also find Manuel’s SOA Container library could’ve been mentioned. Oh, and by now there’s SIMD support in the C++ standard (or is it only a proposal?), haven’t looked at it.
More importantly, one may ask if it’s actually worth implementing something similar to that:
|
|
Because, it might actually turn out that already a scalar function
|
|
would make use of vector instructions, but putting x, y, z into a packed float.
comments
discuss this post on reddit if you like
EDIT history
- I re-added a paragraph on the cost function for optimizing 2D loops.
- azswcowboy spotted that I had pasted the wrong link to
combinations.
-
or at least all pairwise distances squared. It’s sufficient to compute Δx^2+Δy^2+Δz^2. No need to take the square root. Maybe one doesn’t even need to compute all distances squared. I haven’t tried to solve it yet and haven’t checked if there’s a way to be more efficient and avoid many computations. ↩︎
-
Turns out, one can even add two of these pragmas to a function
#pragma omp declare simd uniform(x_1)\n#pragma omp declare simd uniform(x_2)and it will generatev[i] = fun(x1, x2[i])andv[i] = fun(x1[i], x2). ↩︎ -
We’re also talking about points and lines with uncertain coordinates, such that the geometric calculations might or might not have to take the uncertainties into account and compute a result with uncertainties and between all the input quantities there might be correlated uncertainties … Needless to say, while this sounds like it’s doable with highschool math but isn’t and requires a decent bit of code. ↩︎
-
For the expert, we’re looking for a B decay product, these are identified - not perfectly but pretty well - by having a non-zero impact parameter (IP) with respect to all primary vertices. So starting from a sample of tracks, the ones with large IP are the interesting ones. ↩︎