Visual Studio Code - How to calculate 2x2 matrix multiplied by 2D vector using SSE intrinsics (32 bit floating points)? (C++, Mac and Windows)

user19179144
January 18, 2023
236 views
1 vote
2 Answers

I need to calculate a 2D matrix multiplied with 2D vector. Both use 32 bit floats. I’m hoping to do this using SSE (any version really) for speed optimization purposes, as I’m going to be using it for realtime audio processing.

So the formula I would need is the following:

left  = L*A + R*B
right = L*C + R*D

I was thinking of reading the whole matrix from memory as a 128 bit floating point SIMD (4 x 32 bit floating points) if it makes sense. But if it’s a better idea to process this in smaller pieces, then that’s fine too.

L & R variables will be in their own floats when the processing begin, so they would need to be moved into the SIMD register/variable and when the calculation is done, moved back into regular variables.

The IDEs I’m hoping to get it compiled on are Xcode and Visual Studio. So I guess that’ll be Clang and Microsoft’s own compilers then which this would need to run properly on.

All help is welcome. Thank you in advance!

I already tried reading SSE instruction sets, but there seems to be so much content in there that it would take a very long time to find the suitable instructions and then the corresponding intrinsics to get anything working.

ADDITIONAL INFORMATION BASED ON YOUR QUESTIONS:

The L & R data comes from their own arrays of data. I have pointers to each of the two arrays (L & R) and then go through them at the same time. So the left/right audio channel data is not interleaved but have their own pointers. In other words, the data is arranged like: LLLLLLLLL RRRRRRRRRR.
Some really good points have been made in the comments about the modern compilers being able to optimize the code really well. This is especially true when multiplication is quite fast and shuffling data inside the SIMD registers might be needed: using more multiplications might still be faster than having to shuffle the data multiple times. I didn’t realise that modern compilers can be that good these days. I have to experiment with Godbolt using std::array and seeing what kind of results I’ll get for my particular case.
The data needs to be in 32 bit floats, as that is used all over the application. So 16 bit doesn’t work for my case.

MORE INFORMATION BASED ON MY TESTS:

I used Godbolt.org to test how the compiler optimizes my code. What I found is that if I do the following, I don’t get optimal code:

using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;

Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
    Vec2 result;

    result[0] = v[0]*m[0] + v[1]*m[1];
    result[1] = v[0]*m[2] + v[1]*m[3];

    return result;
}

But if I do the following, I do get quite nice code:

using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;

Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
    Vec2 result;

    result[0] = v[0]*m[0] + v[1]*m[2];
    result[1] = v[0]*m[1] + v[1]*m[3];

    return result;
}

Meaning that if I transpose the 2D matrix, the compiler seems to output pretty good results as is. I believe I should go with this method since the compiler seems to be able to handle the code nicely.

Tags: c#intrinsics matrix-multiplication optimization sse

Answers

You are better leave the assembly generation to your compiler. It is doing a great job from what I could gather.

Additionally, GCC and CLANG (not sure about the others) have extension attributes that allow you to compile code for several different architectures, one of which will be picked at runtime.

For example, consider the following code:

using Vector = std::array<float, 2>;

using Matrix = std::array<float, 4>;

namespace detail {
Vector multiply(const Matrix& m, const Vector& v) {
    Vector r;
    r[0] = v[0] * m[0] + v[1] * m[2];
    r[1] = v[0] * m[1] + v[1] * m[3];
    return r;
}
}  // namespace detail

__attribute__((target("default"))) 
Vector multiply(const Matrix& m, const Vector& v) {
    return detail::multiply(m, v);
}

__attribute__((target("avx"))) 
Vector multiply(const Matrix& m, const Vector& v) {
    return detail::multiply(m, v);
}

Assume that you compile it with

g++ -O3 -march=x86-64 main.cpp -o main

For AVX it creates perfectly optimized AVX SIMD code

multiply(Matrix const&, Vector const&) [clone .avx]:      # @multiply(Matrix const&, Vector const&) [clone .avx]
        vmovsd  (%rdi), %xmm0                   # xmm0 = mem[0],zero
        vmovsd  8(%rdi), %xmm1                  # xmm1 = mem[0],zero
        vbroadcastss    4(%rsi), %xmm2
        vmulps  %xmm1, %xmm2, %xmm1
        vbroadcastss    (%rsi), %xmm2
        vmulps  %xmm0, %xmm2, %xmm0
        vaddps  %xmm1, %xmm0, %xmm0
        retq

While the default implementation uses SSE instructions only:

multiply(Matrix const&, Vector const&):          # @multiply(Matrix const&, Vector const&)
        movsd   (%rdi), %xmm1                   # xmm1 = mem[0],zero
        movsd   8(%rdi), %xmm2                  # xmm2 = mem[0],zero
        movss   (%rsi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
        movss   4(%rsi), %xmm3                  # xmm3 = mem[0],zero,zero,zero
        shufps  $0, %xmm3, %xmm3                # xmm3 = xmm3[0,0,0,0]
        mulps   %xmm2, %xmm3
        shufps  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]
        mulps   %xmm1, %xmm0
        addps   %xmm3, %xmm0
        retq

You might want to check out this post

Godbolt link: https://godbolt.org/z/fcKvchvcb

- PeterCordes
- January 19, 2023 at 12:09 pm
- 0 votes
0
It’s much better to let compilers vectorize over the whole arrays of LLLL and RRRR samples, not for one left, right sample pair at once.

With the same mixing matrix for a whole array of audio samples, you get nice asm with no shuffles. Borrowing code from Nole’s answer just to illustrate the auto-vectorization (you might want to simplify the
```
struct Vector {
    std::array<float,2> coef;
};

struct Matrix {
    std::array<float,4> coef;
};

    static Vector multiply( const Matrix& m, const Vector& v ) {
        Vector r;
        r.coef[0]  = v.coef[0]*m.coef[0] + v.coef[1]*m.coef[2];
        r.coef[1] = v.coef[0]*m.coef[1] + v.coef[1]*m.coef[3];
        return r;
    }


// The per-element functions need to inline into the loop,
// so target attributes need to match or be a superset.  
// Or better, just don't use target options on the per-sample function
__attribute__ ((target ("avx,fma")))
void intermix(float *__restrict left, float *__restrict right, const Matrix &m)
{
    for (int i=0 ; i<10240 ; i++){
        Vector v = {left[i], right[i]};
        v = multiply(m, v);
        left[i] = v.coef[0];
        right[i] = v.coef[1];
    }
}
```
GCC -O3 (without any target options) compiles this to nice AVX1 + FMA code, as per the __attribute__((target("avx,fma"))) (Similar to -march=x86-64-v3). (Godbolt)
```
# GCC (trunk) -O3
intermix(float*, float*, Matrix const&):
        vbroadcastss    (%rdx), %ymm5
        vbroadcastss    8(%rdx), %ymm4
        xorl    %eax, %eax
        vbroadcastss    4(%rdx), %ymm3
        vbroadcastss    12(%rdx), %ymm2       # broadcast each matrix element separately
.L2:
        vmulps  (%rsi,%rax), %ymm4, %ymm1     # a whole vector of 8 R*B
        vmulps  (%rsi,%rax), %ymm2, %ymm0
        vfmadd231ps     (%rdi,%rax), %ymm5, %ymm1
        vfmadd231ps     (%rdi,%rax), %ymm3, %ymm0
        vmovups %ymm1, (%rdi,%rax)
        vmovups %ymm0, (%rsi,%rax)
        addq    $32, %rax
        cmpq    $40960, %rax
        jne     .L2
        vzeroupper
        ret
main:
        movl    $13, %eax
        ret
```
Note how there are zero shuffle instructions because the matrix coefficients each get broadcast to a separate vector, so one vmulps can do R*B for 8 R samples in parallel, and so on.

Unfortunately GCC and clang both use indexed addressing modes, so the memory source vmulps and vfma instructions un-laminate into 2 uops for the back-end on Intel CPUs. And the stores can’t use the port 7 AGU on HSW/SKL. -march=skylake or any other specific Intel SnB-family uarch doesn’t fix that for either of them. Clang unrolls by default, so the extra pointer increments to avoid indexed addressing modes would be amortized. (It would actually just be 1 extra add instruction, since we’re modifying L and R in-place. You could of course change the function to copy-and-mix.)

If the data is hot in L1d cache, it’ll bottleneck on the front-end rather than load+FP throughput, but it still comes relatively close to 2 loads and 2 FMAs per clock.

Hmm, GCC is saving instructions but costing extra loads by loading the same L and R data twice, as memory-source operands for vmulps and vfmadd...ps. With -march=skylake it doesn’t do that, instead using a separate vmovups (which has no problem with an indexed addressing mode, but the later store still does.)

I haven’t looked at tuning choices from other GCC versions.
```
# GCC (trunk) -O3 -march=skylake   (which implies -mtune=skylake)
.L2:
        vmovups (%rsi,%rax), %ymm1
        vmovups (%rdi,%rax), %ymm0     # separate loads
        vmulps  %ymm1, %ymm5, %ymm2    # FP instructions using only registers
        vmulps  %ymm1, %ymm3, %ymm1
        vfmadd231ps     %ymm0, %ymm6, %ymm2
        vfmadd132ps     %ymm4, %ymm1, %ymm0
        vmovups %ymm2, (%rdi,%rax)
        vmovups %ymm0, (%rsi,%rax)
        addq    $32, %rax
        cmpq    $40960, %rax
        jne     .L2
```
This is 10 uops, so can issue 2 cycles per iteration on Ice Lake, 2.5c on Skylake. On Ice Lake, it will sustain 2x 256-bit mul/FMA per clock cycle.

On Skylake, it doesn’t bottleneck on AGU throughput since it’s 4 uops for ports 2,3 every 2.5 cycles. So that’s fine. No need for indexed addressing modes.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Visual Studio Code – How to calculate 2×2 matrix multiplied by 2D vector using SSE intrinsics (32 bit floating points)? (C++, Mac and Windows)

Answers