I need to calculate a 2D matrix multiplied with 2D vector. Both use 32 bit floats. I’m hoping to do this using SSE (any version really) for speed optimization purposes, as I’m going to be using it for realtime audio processing.
So the formula I would need is the following:
left = L*A + R*B
right = L*C + R*D
I was thinking of reading the whole matrix from memory as a 128 bit floating point SIMD (4 x 32 bit floating points) if it makes sense. But if it’s a better idea to process this in smaller pieces, then that’s fine too.
L & R variables will be in their own floats when the processing begin, so they would need to be moved into the SIMD register/variable and when the calculation is done, moved back into regular variables.
The IDEs I’m hoping to get it compiled on are Xcode and Visual Studio. So I guess that’ll be Clang and Microsoft’s own compilers then which this would need to run properly on.
All help is welcome. Thank you in advance!
I already tried reading SSE instruction sets, but there seems to be so much content in there that it would take a very long time to find the suitable instructions and then the corresponding intrinsics to get anything working.
ADDITIONAL INFORMATION BASED ON YOUR QUESTIONS:
-
The L & R data comes from their own arrays of data. I have pointers to each of the two arrays (L & R) and then go through them at the same time. So the left/right audio channel data is not interleaved but have their own pointers. In other words, the data is arranged like: LLLLLLLLL RRRRRRRRRR.
-
Some really good points have been made in the comments about the modern compilers being able to optimize the code really well. This is especially true when multiplication is quite fast and shuffling data inside the SIMD registers might be needed: using more multiplications might still be faster than having to shuffle the data multiple times. I didn’t realise that modern compilers can be that good these days. I have to experiment with Godbolt using std::array and seeing what kind of results I’ll get for my particular case.
-
The data needs to be in 32 bit floats, as that is used all over the application. So 16 bit doesn’t work for my case.
MORE INFORMATION BASED ON MY TESTS:
I used Godbolt.org to test how the compiler optimizes my code. What I found is that if I do the following, I don’t get optimal code:
using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;
Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
Vec2 result;
result[0] = v[0]*m[0] + v[1]*m[1];
result[1] = v[0]*m[2] + v[1]*m[3];
return result;
}
But if I do the following, I do get quite nice code:
using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;
Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
Vec2 result;
result[0] = v[0]*m[0] + v[1]*m[2];
result[1] = v[0]*m[1] + v[1]*m[3];
return result;
}
Meaning that if I transpose the 2D matrix, the compiler seems to output pretty good results as is. I believe I should go with this method since the compiler seems to be able to handle the code nicely.
2
Answers
You are better leave the assembly generation to your compiler. It is doing a great job from what I could gather.
Additionally, GCC and CLANG (not sure about the others) have extension attributes that allow you to compile code for several different architectures, one of which will be picked at runtime.
For example, consider the following code:
Assume that you compile it with
For AVX it creates perfectly optimized AVX SIMD code
While the default implementation uses SSE instructions only:
You might want to check out this post
Godbolt link: https://godbolt.org/z/fcKvchvcb
It’s much better to let compilers vectorize over the whole arrays of LLLL and RRRR samples, not for one
left, right
sample pair at once.With the same mixing matrix for a whole array of audio samples, you get nice asm with no shuffles. Borrowing code from Nole’s answer just to illustrate the auto-vectorization (you might want to simplify the
GCC
-O3
(without any target options) compiles this to nice AVX1 + FMA code, as per the__attribute__((target("avx,fma")))
(Similar to-march=x86-64-v3
). (Godbolt)Note how there are zero shuffle instructions because the matrix coefficients each get broadcast to a separate vector, so one
vmulps
can doR*B
for 8 R samples in parallel, and so on.Unfortunately GCC and clang both use indexed addressing modes, so the memory source vmulps and vfma instructions un-laminate into 2 uops for the back-end on Intel CPUs. And the stores can’t use the port 7 AGU on HSW/SKL.
-march=skylake
or any other specific Intel SnB-family uarch doesn’t fix that for either of them. Clang unrolls by default, so the extra pointer increments to avoid indexed addressing modes would be amortized. (It would actually just be 1 extraadd
instruction, since we’re modifying L and R in-place. You could of course change the function to copy-and-mix.)If the data is hot in L1d cache, it’ll bottleneck on the front-end rather than load+FP throughput, but it still comes relatively close to 2 loads and 2 FMAs per clock.
Hmm, GCC is saving instructions but costing extra loads by loading the same L and R data twice, as memory-source operands for
vmulps
andvfmadd...ps
. With-march=skylake
it doesn’t do that, instead using a separatevmovups
(which has no problem with an indexed addressing mode, but the later store still does.)I haven’t looked at tuning choices from other GCC versions.
This is 10 uops, so can issue 2 cycles per iteration on Ice Lake, 2.5c on Skylake. On Ice Lake, it will sustain 2x 256-bit mul/FMA per clock cycle.
On Skylake, it doesn’t bottleneck on AGU throughput since it’s 4 uops for ports 2,3 every 2.5 cycles. So that’s fine. No need for indexed addressing modes.