31 Dec 2025

SIMD with more than one argument

Let’s look at 2025’s Advent of Code Day 8 Puzzle. We have a list of points in 3D and need to compute all pairwise distances¹.

Contrary to the AoC puzzle, I’ll work for the start in 1D with floating point coordinates. We need something like

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


float dist(float x1, float x2) {
  return /* some math here*/;
}

auto one_level_higher_in_the_call_stack(std::span<float> xs) {
  for (std::size_t one = 0; one < xs.size(); one++) {
    for (std::size_t two = 0; two < one; two++) {
      auto distance = dist(xs[one], xs[two]);
      // /* do something with the return value */
    }
  }
  return /* something useful */
}

Before your brain wanders off:

Range based loops would probably be nicer, but I’m not aware of a standard library equivalent of combinations.
For optimization purposes, it’s not obvious to me in which order the 2D space of one and two should be traversed. Let’s hope for now the optimizer can do what it thinks is best
The signature of one_level_higher_in_the_call_stack and the handling of the distance return value also has large impact on how optimal the code is, but the draft for this post is already way too long and too all-over-the-place.
Actually, this is way more complicated than what I want to discuss now.

Let’s consider only the inner loop

1
2
3
4


auto x_one = // maybe from the outer loop, maybe somewhere else
for ( /* doesn't really matter */ ) {
  some_buffer[i] = dist(x_one, xs[i]);
}

This is not the most common pattern of a vectorised function call. Of cause, when the definition of dist and that loop are in the same translation unit, then the optimiser will inline dist and then read chunks of 2, 4, 8 or more from xs and write chunks of these vector sizes to some_buffer. But if dist was too complex to inline or even in a different translation unit, it’s less obvious what will happen.

Consider my previous post about leveraging the auto-vectoriser with symbols in different translation units.

With a dist definition like this (compiler-explorer link)

1
2
3
4


__attribute__((const, simd("notinbranch")))
float dist(float x_1, float x_2) {
  // some math here
}

the following symbols

1
2
3
4
5


_Z4distff
_ZGVbN4vv__Z4distff
_ZGVcN8vv__Z4distff
_ZGVdN8vv__Z4distff
_ZGVeN16vv__Z4distff

Here, both x_1 and x_2 are packed floats. They are good for a call v[i] = dist(x_1[i], x_2[i]) with different x_1 and x_2 input ranges. We can build a call site like that (compiler-explorer link)

1
2
3


for (std::size_t i = 0; i < input.size(); ++i) {
  output[i] = dist(x_1, input[i]);
}

and the vectorized version gets called.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


        vbroadcastss    ymm2, xmm0
        push    QWORD PTR [r10-8]
        vmovaps ymm0, ymm2
        push    rbp
        mov     rbp, rsp
        push    r12
        mov     r12, rsi
        push    r10
        push    rbx
        mov     rbx, rdi
        sub     rsp, 56
        vmovaps YMMWORD PTR [rbp-80], ymm2
        vmovups ymm1, YMMWORD PTR [rsi]
        call    _ZGVdN8vv__Z4distff

Neat! We need v[i] = dist(x_1, x_2[i]) at the call site (one scalar and one packed argument), the implementation provided only a version with two packed arguments, and the compiler used a broadcast on the call site to match the interface.

And I think this even has a certain appeal. Chances are that any fun(scalar, vector[i]) function actually needs to do a broadcast of the scalar internally for the low level binary (binary = taking two arguments) instructions. With the broadcast on the call site, we could avoid duplicated work, if the scalar argument got provided to multiple functions:

1
2
3
4
5
6


// don't 
__m256 x = …;
float y = …;
__m256 v = some_fun_1(y, x); // calls _mm256_broadcast_ss(y) internally
__m256 z = some_fun_2(y, v); // calls _mm256_broadcast_ss(y) internally, again
// total of two broadcasts

1
2
3
4
5
6
7


// instead do
__m256 x = …;
float y = …;
__m256 y_transformed = _mm256_broadcast_ss(y);
__m256 v = some_fun_1(y_transformed, x); // no call to _mm256_broadcast_ss internally
__m256 z = some_fun_2(y_transformed, v); // no call to _mm256_broadcast_ss internally
// only one broadcast

Admittedly, the broadcast is probably cheap but more a general concern that one can consider to expose the internally needed types in a function interface, such that the caller sees them and overall back-and-forth conversions can be avoided.

possible applications

Cool, so we have a function float fun(float, float) and a vectorised version __m256 fun(__m256, __m256) can be used for the three use cases

v[i] = fun(x[i], y[i])
v[i] = fun(x, y[i])
v[i] = fun(x[i], y) by just broadcasting the scalar argument.

limitations

At the time of writing, I haven’t come up with an example, but I could imagine that a dedicated scalar+vector argument function could be implemented more optimally than a broadcast-scalar+vector implementation.

And here I went down a rabbit hole. With __attribute__((const, simd("notinbranch"))), I can not specify which of the arguments should be packed and which should be scalar. Is this a limitation of the gcc attribute OR is this a limiation of the x84 ABI that the gcc attribute exposes?

closer look at the ABI

The ABI documentation actually answers that question more helpfully than I expected! With examples - which highlight that I should’ve looked at openmp:

1
2
3
4


#pragma omp declare simd uniform(x_1)
float dist(float x_1, float x_2) {
  return x_1+x_2;
}

gives the following symbols

1
2
3
4
5
6
7
8
9


Z4distff
_ZGVbN4uv__Z4distff
_ZGVbM4uv__Z4distff
_ZGVcN8uv__Z4distff
_ZGVcM8uv__Z4distff
_ZGVdN8uv__Z4distff
_ZGVdM8uv__Z4distff
_ZGVeN16uv__Z4distff
_ZGVeM16uv__Z4distff

Compared to before, we have twice as many symbols, but only because the compiler generated masked and unmasked versions. The other difference is that we have 8uv instead of 8vv in the end of the vectorisation specification. Meaning that the first argument is uniform.

Cool, should’ve discovered that earlier².

More generalization

Going back, closer to the original AoC puzzle though, we have two argument ranges with different indices:

1

v[i,j] = dist(x[i], x[j]);

I know mdspan exists, but please pretend this was just quasi-code, the memory layout of the return is a bit distracting. Let’s pretend also that the two arguments might not come from the same array (after all, for sufficiently different i and j, they are just different memory addresses):

1

v[i,j] = dist(x[i], y[j]);

In terms of linear algebra, this reminds me of the question “what is vector times vector?”. Whereas vector-transposed times vector is a scalar, vector times vector-transposed is a matrix, and in numerics software one often needs element-wise products, such that vector times vector is another vector.

Similarly, when I ask for a vecorized version of float fun(float, float), I am thinking of the possibilities

v[i] = fun(x[i], y[i]) (what the gcc attribute generates)
v[i] = fun(x, y[i])
v[i] = fun(x[i], y)
v[i,j] = fun(x[i], y[j])

Though the similarity isn’t great. The linear algebra scalar product needs another reduce somewhere.

It also came to my mind that in differential geometry and/or general relativity, a confusion between covariant and contravariant indices has similar ambiguity:

Σ x[i] y[i]
or x[i] y[j]

EDIT: a section that fell to the cutting room floor

With the above ingredients we can provide these three functions to the vectoriser:

v[i] = fun(x[i], y[i]) (OpenMP pragma or gcc attribute)
v[i] = fun(x, y[i]) (OpenMP pragma or broadcast + gcc attribute)
v[i] = fun(x[i], y) (OpenMP pragma or broadcast + gcc attribute)

And a v[i,j] = fun(x[i], y[j]) call could be done with nested loops

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


// either
for (i …) {
  x = xs[i];
  for (j …) {
    v[i,j] = fun(x, ys[j]); // y passed as packed float
  }
}

// or
for (j …) {
  y = ys[j];
  for (i …) {
    v[i,j] = fun(xs[i], y); // x passed as packed float
  }
}

Now I’m making a the assumption that the optimizer sees through our gymnastics, recognizes “this is a 2D loop”, and reorders as it sees fit. For that, it would be nice to know if either of the two half-vectorised fun implementations is more or less efficient than the other to include that in the optimizer’s cost function. There is no way the compiler of the fun definition can pass that information to the compiler of the call site. The compiler of the call site could do both versions and let LTO figure it out??? Or the programmer could add that information to the declaration (would require the programmer to know though).

In either case, the output format is probably the bigger concern.

END EDIT

JUMP: where i had encountered this issue before

This ambiguity has bitten us during my time at CERN.

We had a function that was vectorized. Not with any of the techniques above. The function was templated, where one of the arguments could be a scalar or packed float, depending on the instantiation, compiler flags, …

1
2


template <typename maybe_packed_floats>
maybe_packed_floats function_name(scalar_float, maybe_packed_floats);

This was for the computation of v[i] = function_name(x, y[i]). Now I had made the fateful change that would allow vectorisation in the other (intended only for the other) dimension v[i] = function_name(x[i], y).

1
2


template <typename maybe_packed_x, typename maybe_packed_y>
??? function_name(maybe_packed_x, maybe_packed_y)

And thanks to our SIMD framework, all combinations of scalar and vector arguments for basic math binary operations were implemented. Such that the template definition could be instantiated for these three combinations:

function_name(x, y[i])
function_name(x[i], y)
function_name(x[i], y[i])

Intended application

Strongly simplifying, at the place in the code where I was working, the reconstruction software had already found a bunch of points in 3D and a bunch of lines in 3D.³ And for some applications we needed “for this given line, compute the distance to all points, and return the minimal one. Do this for all lines. We want the minimal distance of every lines”.⁴ (These distances are also called “impact parameter”).

Others needed “for this given point, compute the distances of all lines. And do that for all points. We want for each point the close-by lines”.

where things went working

Ofc, the call sites did implement 2D loops:

1
2
3
4
5


for (auto line_like: lines) {
  auto suitable_output = std::min_element(points, /* default less than */,
                                          [&line_like](auto point_like) { dist(line_like, point_like); });
  // do something with the output
}

Though, let’s rewrite that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


for (auto line_like: lines) {
  auto distances_to_points = CONTAINER(points.size());
  for (auto point_like: points) {
    distances_to_points[points_index] = dist(line_like, point_like);
  }

  // This `min` isn't what i'm talking about, but for our application here,
  // the signature is `__m256 min(__m256[N])`.
  min_impact_parameter[lines_index] = min(distances_to_points);
}

The innocent issue is hidden in the fact that lines and points were written for SIMD application and begin() and end() gave iterators with stride length 2, 4, 8, … with x, y, z being __256 (or similar) data types to represent 2, 4, 8, … lines or points.

Before my change, points could only do stride length of one. After my change, both lines and points had stride length of 2, 4, 8, … who really remembers, let’s say stride length 8.

And as a result, we computed for line 0 the distances to points 0, 8, 16, 24, 32, 40, …. For line 1, we computed the distance to points 1, 9, 17, 25, 33, 41, ….

Instead of filling a matrix with all the line-point distances, we only filled a matrix composed of 8-by-8 diagonal matrices.

Unit tests to the rescue

While build time tests completely missed that I fucked up the dimensions of the data, unit tests quickly spotted that the impact parameter computation was far off. (Looking at line 0, the minimal IP was now not the minimum over distances to points 0, 1, 2, 3, 4, … but only over 0, 8, 16, … and thus way larger).

Conclusion

None so far. It seems to me that it was a flaw of the framework that lets one build the API of dist in a way that vectorises correctly over either argument but incorrectly - unintendedly - over both. The fact that the arguments are templated obfuscates the matter further. Of cause we could’ve added a SFINAE check that prevents double-vectorization, but what I would’ve really wanted is a way to say “these two use different indices”.

But even if we had had something for that, one hits the issue of output types. When I want to implement v[i,j] = fun(x[i], y[j]), then the memory layout of the output is far from trivial - whether it’s row-major or column-major, the stride between columns or rows is unknown and the output memory has gaps.

And it all makes me beg for “can’t the compiler’s optimizer figure this out for me?”.

There already is an unseq version of some standard library functions, including for_each. But I don’t see how to do nested loops / 2D loops with it and let the compiler decide “you figure out which should be the inner loop”.

Not mentioned

Above, I skip over the part where we go from a single float argument to a line in 3D. We did quite some work in that part, and I should say “they” instead of “we” because truly the memory management for AVX-512 by Arthur Hennequin was pretty impressive and then there were the PrZip SIMD iterators (not even sure by whom). I also find Manuel’s SOA Container library could’ve been mentioned. Oh, and by now there’s SIMD support in the C++ standard (or is it only a proposal?), haven’t looked at it.

More importantly, one may ask if it’s actually worth implementing something similar to that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


struct Point {
  __m256 x;
  __m256 y;
  __m256 z;
};
struct Line {
  Point a_point_on_line;
  __m256 direction_x;
  __m256 direction_y;
  __m256 direction_z;
};

auto some_geometry(Line, Point) {
}

Because, it might actually turn out that already a scalar function

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


struct Point {
  float x;
  float y;
  float z;
};
struct Line {
  Point a_point_on_line;
  float direction_x;
  float direction_y;
  float direction_z;
};

auto some_geometry(Line, Point) {
}

would make use of vector instructions, but putting x, y, z into a packed float.

comments

discuss this post on reddit if you like

EDIT history

I re-added a paragraph on the cost function for optimizing 2D loops.
azswcowboy spotted that I had pasted the wrong link to combinations.

or at least all pairwise distances squared. It’s sufficient to compute Δx^2+Δy^2+Δz^2. No need to take the square root. Maybe one doesn’t even need to compute all distances squared. I haven’t tried to solve it yet and haven’t checked if there’s a way to be more efficient and avoid many computations. ↩︎
Turns out, one can even add two of these pragmas to a function #pragma omp declare simd uniform(x_1)\n#pragma omp declare simd uniform(x_2) and it will generate v[i] = fun(x1, x2[i]) and v[i] = fun(x1[i], x2). ↩︎
We’re also talking about points and lines with uncertain coordinates, such that the geometric calculations might or might not have to take the uncertainties into account and compute a result with uncertainties and between all the input quantities there might be correlated uncertainties … Needless to say, while this sounds like it’s doable with highschool math but isn’t and requires a decent bit of code. ↩︎
For the expert, we’re looking for a B decay product, these are identified - not perfectly but pretty well - by having a non-zero impact parameter (IP) with respect to all primary vertices. So starting from a sample of tracks, the ones with large IP are the interesting ones. ↩︎

Yet another blog restart