skip to Main Content

I need to calculate a 2D matrix multiplied with 2D vector. Both use 32 bit floats. I’m hoping to do this using SSE (any version really) for speed optimization purposes, as I’m going to be using it for realtime audio processing.

So the formula I would need is the following:

left  = L*A + R*B
right = L*C + R*D

I was thinking of reading the whole matrix from memory as a 128 bit floating point SIMD (4 x 32 bit floating points) if it makes sense. But if it’s a better idea to process this in smaller pieces, then that’s fine too.

L & R variables will be in their own floats when the processing begin, so they would need to be moved into the SIMD register/variable and when the calculation is done, moved back into regular variables.

The IDEs I’m hoping to get it compiled on are Xcode and Visual Studio. So I guess that’ll be Clang and Microsoft’s own compilers then which this would need to run properly on.

All help is welcome. Thank you in advance!

I already tried reading SSE instruction sets, but there seems to be so much content in there that it would take a very long time to find the suitable instructions and then the corresponding intrinsics to get anything working.

ADDITIONAL INFORMATION BASED ON YOUR QUESTIONS:

  1. The L & R data comes from their own arrays of data. I have pointers to each of the two arrays (L & R) and then go through them at the same time. So the left/right audio channel data is not interleaved but have their own pointers. In other words, the data is arranged like: LLLLLLLLL RRRRRRRRRR.

  2. Some really good points have been made in the comments about the modern compilers being able to optimize the code really well. This is especially true when multiplication is quite fast and shuffling data inside the SIMD registers might be needed: using more multiplications might still be faster than having to shuffle the data multiple times. I didn’t realise that modern compilers can be that good these days. I have to experiment with Godbolt using std::array and seeing what kind of results I’ll get for my particular case.

  3. The data needs to be in 32 bit floats, as that is used all over the application. So 16 bit doesn’t work for my case.

MORE INFORMATION BASED ON MY TESTS:

I used Godbolt.org to test how the compiler optimizes my code. What I found is that if I do the following, I don’t get optimal code:

using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;

Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
    Vec2 result;

    result[0] = v[0]*m[0] + v[1]*m[1];
    result[1] = v[0]*m[2] + v[1]*m[3];

    return result;
}

But if I do the following, I do get quite nice code:

using Vec2 = std::array<float, 2>;
using Mat2 = std::array<float, 4>;

Vec2 Multiply2D(const Mat2& m, const Vec2& v)
{
    Vec2 result;

    result[0] = v[0]*m[0] + v[1]*m[2];
    result[1] = v[0]*m[1] + v[1]*m[3];

    return result;
}

Meaning that if I transpose the 2D matrix, the compiler seems to output pretty good results as is. I believe I should go with this method since the compiler seems to be able to handle the code nicely.

2

Answers


  1. You are better leave the assembly generation to your compiler. It is doing a great job from what I could gather.

    Additionally, GCC and CLANG (not sure about the others) have extension attributes that allow you to compile code for several different architectures, one of which will be picked at runtime.

    For example, consider the following code:

    using Vector = std::array<float, 2>;
    
    using Matrix = std::array<float, 4>;
    
    namespace detail {
    Vector multiply(const Matrix& m, const Vector& v) {
        Vector r;
        r[0] = v[0] * m[0] + v[1] * m[2];
        r[1] = v[0] * m[1] + v[1] * m[3];
        return r;
    }
    }  // namespace detail
    
    __attribute__((target("default"))) 
    Vector multiply(const Matrix& m, const Vector& v) {
        return detail::multiply(m, v);
    }
    
    __attribute__((target("avx"))) 
    Vector multiply(const Matrix& m, const Vector& v) {
        return detail::multiply(m, v);
    }
    

    Assume that you compile it with

    g++ -O3 -march=x86-64 main.cpp -o main
    

    For AVX it creates perfectly optimized AVX SIMD code

    multiply(Matrix const&, Vector const&) [clone .avx]:      # @multiply(Matrix const&, Vector const&) [clone .avx]
            vmovsd  (%rdi), %xmm0                   # xmm0 = mem[0],zero
            vmovsd  8(%rdi), %xmm1                  # xmm1 = mem[0],zero
            vbroadcastss    4(%rsi), %xmm2
            vmulps  %xmm1, %xmm2, %xmm1
            vbroadcastss    (%rsi), %xmm2
            vmulps  %xmm0, %xmm2, %xmm0
            vaddps  %xmm1, %xmm0, %xmm0
            retq
    

    While the default implementation uses SSE instructions only:

    multiply(Matrix const&, Vector const&):          # @multiply(Matrix const&, Vector const&)
            movsd   (%rdi), %xmm1                   # xmm1 = mem[0],zero
            movsd   8(%rdi), %xmm2                  # xmm2 = mem[0],zero
            movss   (%rsi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
            movss   4(%rsi), %xmm3                  # xmm3 = mem[0],zero,zero,zero
            shufps  $0, %xmm3, %xmm3                # xmm3 = xmm3[0,0,0,0]
            mulps   %xmm2, %xmm3
            shufps  $0, %xmm0, %xmm0                # xmm0 = xmm0[0,0,0,0]
            mulps   %xmm1, %xmm0
            addps   %xmm3, %xmm0
            retq
    

    You might want to check out this post

    Godbolt link: https://godbolt.org/z/fcKvchvcb

    Login or Signup to reply.
  2. It’s much better to let compilers vectorize over the whole arrays of LLLL and RRRR samples, not for one left, right sample pair at once.

    With the same mixing matrix for a whole array of audio samples, you get nice asm with no shuffles. Borrowing code from Nole’s answer just to illustrate the auto-vectorization (you might want to simplify the

    struct Vector {
        std::array<float,2> coef;
    };
    
    struct Matrix {
        std::array<float,4> coef;
    };
    
        static Vector multiply( const Matrix& m, const Vector& v ) {
            Vector r;
            r.coef[0]  = v.coef[0]*m.coef[0] + v.coef[1]*m.coef[2];
            r.coef[1] = v.coef[0]*m.coef[1] + v.coef[1]*m.coef[3];
            return r;
        }
    
    
    // The per-element functions need to inline into the loop,
    // so target attributes need to match or be a superset.  
    // Or better, just don't use target options on the per-sample function
    __attribute__ ((target ("avx,fma")))
    void intermix(float *__restrict left, float *__restrict right, const Matrix &m)
    {
        for (int i=0 ; i<10240 ; i++){
            Vector v = {left[i], right[i]};
            v = multiply(m, v);
            left[i] = v.coef[0];
            right[i] = v.coef[1];
        }
    }
    

    GCC -O3 (without any target options) compiles this to nice AVX1 + FMA code, as per the __attribute__((target("avx,fma"))) (Similar to -march=x86-64-v3). (Godbolt)

    # GCC (trunk) -O3
    intermix(float*, float*, Matrix const&):
            vbroadcastss    (%rdx), %ymm5
            vbroadcastss    8(%rdx), %ymm4
            xorl    %eax, %eax
            vbroadcastss    4(%rdx), %ymm3
            vbroadcastss    12(%rdx), %ymm2       # broadcast each matrix element separately
    .L2:
            vmulps  (%rsi,%rax), %ymm4, %ymm1     # a whole vector of 8 R*B
            vmulps  (%rsi,%rax), %ymm2, %ymm0
            vfmadd231ps     (%rdi,%rax), %ymm5, %ymm1
            vfmadd231ps     (%rdi,%rax), %ymm3, %ymm0
            vmovups %ymm1, (%rdi,%rax)
            vmovups %ymm0, (%rsi,%rax)
            addq    $32, %rax
            cmpq    $40960, %rax
            jne     .L2
            vzeroupper
            ret
    main:
            movl    $13, %eax
            ret
    

    Note how there are zero shuffle instructions because the matrix coefficients each get broadcast to a separate vector, so one vmulps can do R*B for 8 R samples in parallel, and so on.

    Unfortunately GCC and clang both use indexed addressing modes, so the memory source vmulps and vfma instructions un-laminate into 2 uops for the back-end on Intel CPUs. And the stores can’t use the port 7 AGU on HSW/SKL. -march=skylake or any other specific Intel SnB-family uarch doesn’t fix that for either of them. Clang unrolls by default, so the extra pointer increments to avoid indexed addressing modes would be amortized. (It would actually just be 1 extra add instruction, since we’re modifying L and R in-place. You could of course change the function to copy-and-mix.)

    If the data is hot in L1d cache, it’ll bottleneck on the front-end rather than load+FP throughput, but it still comes relatively close to 2 loads and 2 FMAs per clock.

    Hmm, GCC is saving instructions but costing extra loads by loading the same L and R data twice, as memory-source operands for vmulps and vfmadd...ps. With -march=skylake it doesn’t do that, instead using a separate vmovups (which has no problem with an indexed addressing mode, but the later store still does.)

    I haven’t looked at tuning choices from other GCC versions.

    # GCC (trunk) -O3 -march=skylake   (which implies -mtune=skylake)
    .L2:
            vmovups (%rsi,%rax), %ymm1
            vmovups (%rdi,%rax), %ymm0     # separate loads
            vmulps  %ymm1, %ymm5, %ymm2    # FP instructions using only registers
            vmulps  %ymm1, %ymm3, %ymm1
            vfmadd231ps     %ymm0, %ymm6, %ymm2
            vfmadd132ps     %ymm4, %ymm1, %ymm0
            vmovups %ymm2, (%rdi,%rax)
            vmovups %ymm0, (%rsi,%rax)
            addq    $32, %rax
            cmpq    $40960, %rax
            jne     .L2
    

    This is 10 uops, so can issue 2 cycles per iteration on Ice Lake, 2.5c on Skylake. On Ice Lake, it will sustain 2x 256-bit mul/FMA per clock cycle.

    On Skylake, it doesn’t bottleneck on AGU throughput since it’s 4 uops for ports 2,3 every 2.5 cycles. So that’s fine. No need for indexed addressing modes.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search