05 Jul 2024

Code I wouldn't put into production anymore - manual auto vectorization

Let’s be clear, this is no work of my own. It is based on what Florian Lemaitre once explained to me and committed to the LHCb software.¹ It was something that I had been wondering about for a while and hadn’t found resources on the internet.

Why not put it in production anymore?

Thinking of the code that I write these days for production, I’m afraid it’s way more important to be able to debug and maintain the code base than to squeeze the last CPU cycle out of the performance. Also from a developer time point of view, the opportunity cost for writing these blocks is way too high to justify (imo).

In Florian’s example, we couldn’t have traded developer pay for buying CPUs. The code was rolled out on many systems running continuously with a large code stack. A few crucial functions sped up that get called from all over the code base would’ve made sense. And, I’d say having inhomogeneous readability standards across the code base was also justified (a small set of hot, unreadable, very thoroughly tested code while most other higher level code with lower maintenance cost).

In other words: I don’t mean to say such code must not go into production, I’m saying it should be well justified. “I want to use this for fun in Advent of Code” or “Here it really pays off” are good justifications. But if you’re just ramping up a code optimization effort, you probably have better return on investment (of your time) in other places and incur lower maintenance cost (and debug nightmares).

Back to the example

The code has by now been removed from the LHCb repositories, though it still exists in the git history. The removal was done with the comment that superseded, outdated, or deprecated code was removed, even though my impression was that the code wasn’t adopted yet - Sadly, a fate that probably too much code faces: a fixed term contributor implements a functionality / provides an API, deploys it (with tests and documentation), but then with the advocate gone, nobody ever calls the function.

But anyway, how can one use the vector unit of a CPU in C++ code? I want to highlight two options:

Let the compiler do everything

Auto vectorizers are good and get better. So maybe the compiler can automatically do all the vectorization for us.

1
2
3
4
5
6
7


int some_scalar_function(int);

void do_something(std::vector<int>& input, std::vector<int>& output) {
    for (int i = 0; i < input.size(); i++) {
        output[i] = some_scalar_function(input[i]);
    }
}

In the source code we’re only exposed to scalar functions, and are completely oblivious of the vectorization that will happen during compilation.

A downside: we’re fully limited to the compiler’s capabilities and if we have evidence (inspecting the assembly? printout?) evidence that the compiler does not manage to vectorize some_scalar_function, we have to accept scalar processing even if we’re able to write the vectorized code with intrinsics ourselves.

NB: The issues with the compiler’s vectorization might just be that vector instructions only provide a certain numeric precision while the scalar version provides higher precision and thus the compiler decides to stick with the scalar version.

Also, be aware that any programmer with access to time travel might as well use a future compiler version with better optimization capabilities and rely on auto-vectorization for everything. Any programmer without access to time travel will get criticized for not using the auto-vectorizer if their code from today prevails to be read in the future.

Write everything in intrinsics

1
2
3
4
5
6
7


__m256 some_function(__m256);

void do_something(std::vector<__m256>& input, std::vector<__m256>& output) {
    for (int i = 0; i < input.size(); i++) {
        output[i] = some_function(input[i]);
    }
}

Implementing some_function is great because you can go to the limit of what the programming team is able to do and implement everything that the compiler can’t do.

It’s usually harder to maintain than the scalar code, harder to port to other platforms, has basically no transition path for coming architectures with larger vector registers.

When used in larger projects, one also faces the issue that either everywhere vector types and functions are used, or one might need to convert back and forth between scalar and vector types. (If I recall correctly, these data types don’t actully get converted by any generated code. C++ just gives data variable types to some block of memory and one may not read a bunch of floats as _m256. But when actually converting between them one only uses functions to make the compiler happy and have valid C++. The CPU won’t do anything.)

So what do I want?

What I want is ultimately the ability to provide the implementation of a function with intrinsics as __m256 f(__m256), and tell the auto vectorizer that it shall use it for a scalar for-loop calling output[i] = f(input[i]). In other words, I aim to provide an implementation __m256 f(__m256) which shall be called like a float f(float) function.

Call site

In my example f is uncreatively named wubbel and the call site is

1
2
3


  for (std::size_t i = 0; i < SIZE; ++i) {
    t[i] = wubbel(v[i]);
  }

with the function declaration

1
2


__attribute__((const, simd("notinbranch")))
float wubbel(float x) noexcept asm("_Z6wubbelf")

The prefix _Z6 and suffix f are the regular assembly symbol for a function taking a float as argument and returning a float.

In the assembly of the call size, one can see that wubbel gets called with multiple different symbols, such as

1
2
3
4
5


        vmovups ymm0, YMMWORD PTR [r12]
        add     r12, 32
        add     r15, 32
        call    _ZGVcN8v_wubbel(float)
        vmovups YMMWORD PTR [r15-32], ymm0

1
2
3
4
5


        vmovups xmm0, XMMWORD PTR [r14+rax*4]
        lea     rdx, [r13+0+rax*4]
        mov     QWORD PTR [rbp-72], rax
        mov     QWORD PTR [rbp-80], rdx
        call    _ZGVbN4v_wubbel(float)

These symbols are explained on sourceware.org though I’ll try to repeat the bits that I understood:

_ZGV This is a vectorized function
b for SSE and c for AVX
N this is an unmasked version (i.e. the operation should be applied to all elements of the vector and not according to a mask)
8 or 4 The size of the vector unit, i.e. how many elements get computed with each call
some more parameter about alignment and such

So in short, the simd("notinbranch") attribute told the compiler “this function is available in multiple SIMD versions to link against.

Implementation site

In case of possible auto vectorization, the simd attribute can also generate code for these symbols.

More importantly, one can also override the symbols for a function from those that the compiler generates by custom ones:

1
2
3
4


__attribute__((const)) __m128 wubbel(__m128 input) noexcept asm("_ZGVbN4v__Z6wubbelf");
__attribute__((const)) __m128 wubbel(__m128 input) noexcept {
  return _mm_hadd_ps(_mm_rsqrt_ps(input), input);
}

In this case, this allows to implement a function that takes an __m128 packed float and returns one of these, but actually comes with the symbol for a vectorized function that takes a single float.

NB: The regular mangled symbol for a function that takes and returns __m128 is _Z6wubbelDv4_f.

Of cause, writing custom symbols for your functions is a matter that I consider error-prone and would prefer to let the compiler do. Assuming we don’t introduce errors in the mangled symbol, we rather blindly assume that the argument variable data layout of __m128 and an array of 4 float are the same.

All that needed exactly like that?

No. It appears the function declaration doesn’t need the assembly symbol user provided. We don’t need to “guess” mangled names and could let the compiler generate symbols by compiling a trivial function with -S and look up the symbol. You can also see that Florian’s code contained a small preprocessor macro to generate and test symbol names. But ultimately, what I posed above was what I managed to copy from Florian and got working. I needed to find a tradeoff between leaving Chesterton’s fence standing and dropping overhead that I didn’t need.

Closing remarks

My small test repo is here. There, I actually implement different functions for the different vector widths to see from runtime printout which functions were called.

I also used it in Advent of code 2022 while completely unneccessarily.

It’s funny how I see 1/sqrt(x) getting optimized so often, as if all of human engineering was described by this one function. I’ve never seen another function get optimized in this way, though at least someone once told me they could write the same for a cubic root instead of a square root. ↩︎

Yet another blog restart