skip to Main Content

The task is to sum the products of multiplying each float in array A with the corresponding element in array B. The arrays could have tens of thousands of elements, and must run say 100,000x sec to handle a real-time data stream, so performance is key.

I’ve coded it using regular math and again with AVX512. It is about 10.6x faster, which is a surprise, as I expected 16x or so given that I’m doing 16x operations per instruction. Furthermore while the loop has various overhead (e.g., looping variables, increments, branch if continuing loop, etc.) it is doing 1/16th of those compared to the naive version.

I’m compiling in Visual Studio 2022 Community, in release mode, and running on an i7-11700F.

Here’s the lines of code. I basically step through the two arrays 16 elements at a time, multiply the respective elements, and keep 16 running sums. At the very end of calculation I use _mm512_reduce_add_ps() to sum those 16 sums.

vector<__m512>      a512In;
vector<__m512>      a512IRCurr;
__m512 fOut = _mm512_set1_ps( 0.0 );

for ( iSample = 0; iSample < iIterations; iSample++ ) 
    fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
                                               a512IRCurr[ iSample ] ) );

I see vmobups doesn’t assume the target is aligned, and wonder if that’s the problem. However I also see that the unaligned versions have been the same speed for many generations as the aligned versions, but a troubling note that latency may still differ: https://community.intel.com/t5/Intel-ISA-Extensions/what-are-the-performance-implications-of-using-vmovups-and/m-p/1143448 While I’m comfortable with machine language of the 6502 variety I don’t know modern Intel.

I also wonder if the _mm512_add_ps is the right instruction for a = a * b constructs, or whether there’s a faster a *= b type instruction.

    for ( iSample = 0; iSample < iIterations; iSample++ )
00007FF6677B2958  movsxd      r8,edi  
00007FF6677B295B  test        edi,edi  
00007FF6677B295D  jle         Circle2AVR512::ProcessInput+0AEh (07FF6677B298Eh)  
        fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
00007FF6677B295F  movsxd      r9,eax  
00007FF6677B2962  mov         rdx,r11  
00007FF6677B2965  shl         r9,6  
00007FF6677B2969  mov         ecx,edi  
00007FF6677B296B  sub         r9,r11  
00007FF6677B296E  add         r9,qword ptr [r10]  
00007FF6677B2971  vmovups     zmm0,zmmword ptr [r9+rdx]  
00007FF6677B2978  vmulps      zmm1,zmm0,zmmword ptr [rdx]  
00007FF6677B297E  lea         rdx,[rdx+40h]  
00007FF6677B2982  vaddps      zmm2,zmm2,zmm1  
00007FF6677B2988  sub         r8,1  
00007FF6677B298C  jne         Circle2AVR512::ProcessInput+91h (07FF6677B2971h)  
                                                   a512IRCurr[ iSample ] ) );

2

Answers


  1. Chosen as BEST ANSWER

    The TLDR: it's memory speed. My L1 cache supports 16-18x speedup, L2 about 10-12x, and L3 about 4.3x, when each is compared to a naive single-data-per-instruction C++ implementation. Once data no longer fits in L1, my CPU cannot fully utilize the AVX512 instructions' parallelism. That's the narrow answer to my narrow question.

    Speed Relative to Circle2 The more in-depth answer: I'm sheepish to take credit for answering when so I've gotten so many ideas from comments on the post, but here are my findings.

    The program processes input data sample by sample, which is stored in a circular queue, and for every sample multiplies the last N samples of input times the corresponding value in a second array called an impulse response. To be clear, I multiply the newest input sample with cell 0 in the impulse response array, the second-newest with cell 1 and so on. The newest input sample I multiply by cell 0 in the impulse response this time, will be multiplied by cell 1 instead when I'm called again for the next input sample, and so on. All of these products are summed, and that sum is the output of the function.

    I made a dozen variations on this matrix multiplication, and use as the baseline one I called "Circle2" (a type of circular queue that doesn't check for wraparound with every operation, rather is composed of two loops whose starting point and iteration count are calculated before entering the loops). Circle2AVX512 basically keeps the same data structures but steps by multiples of 16, and casts float*'s to __m512*. Accesses on the input queue happen to still be aligned but accesses to the impulse response are mostly non-aligned. Circle2AVX512Aligned keeps 16 copies of the impulse response, one at each possible memory alignment. Accesses again are always aligned on the circular queue, but now also aligned on the impulse response, by selecting the edition of the impulse response with the desired alignment.

    My CPU, the i7-11700F, has 32kb L1D (D=data) cache and 256kb L2 cache per core. Then, 16M L3 cache serves the whole computer. (I was web-surfing, youtubing and word processing so not all of this was necessarily available to the test.)

    Looking at Circle2AVX512, my single-threaded app was able to get 16-18x speedup over a naively-programmed though optimized C++ implementation, as long as the data (input buffer plus impulse response) is big enough to overcome the fixed overhead but small enough to fit in L1--32k total, or in other words, 16k each for each of two structures: 4000 samples each, in 4-byte floats.

    Then performance fell to a lower plateau of 10-12x improvement over naive single-data math, until the two structures grew past 256k, at which L2 was no longer sufficient.

    At that point it maintains a steady 8.4x to 9.0x speedup over the naive code at least up to 6.4M (two 3.2M data structures).

    A future direction of research would be to see how it performs when the L3 cache is done.

    Note the access pattern of memory is the classic worst-case for an MRU cache: linear access, mockingly dropping data from the cache just before you're about to finally get back to it. However, as a direction for further work, the software could also take advantage of the classic-best-case for an MRU cache: extreme locality of reference. If the software were presented with multiple input samples in a group, this could be highly optimized by utilizing the same ranges of both the input queue and the impulse response before moving on, and thus achieving extremely high locality of reference. Calculating the first sample would still have to re-load all data from cache but the next group of input samples could free-ride off that loaded cache.

    Frankly I don't understand why the Circle2AVX512Aligned algo is performing as well as we see here. Instead of one copy of the impulse response, it has 16 completely separate copies. So it should run out of L1 cache at about 3.75k per data structure (again, the input queue is the same size in bytes as the impulse response; there's simply 16 impulse responses instead of one). And indeed, that's where we see it switch from significantly out-performing the non-aligned version, to underperforming. And yet I'm still confused why, on this modern CPU that sources tell me should be able to access unaligned memory as fast, why the aligned version is so much faster. And I'm also confused why the aligned version doesn't get far slower than it does. My only guess is that the input queue stays nailed to the cache so at least we have that going for us.

    Likewise the Circle2AVX512Aligned no longer fits into L2 cache at 256k/17 when the individual structures are 256k/17=9.17k each. OK, a gap does open up at that point but not as big as I'd have expected.

    Finally the Circle2AVX512Aligned no longer fits into L3 cache at 16M/17 when the individual structures are 16M/17=963k each, or so. And this time, much clearer degradation.


  2. Background facts: mul+add can be done as one FMA, (Fused Multiply-Add), e.g. _mm512_fmadd_ps

    MSVC defaults to #pragma STDC FP_CONTRACT off, not contracting a*b+c into fma(a,b,c). And doesn’t recognize pragmas for it, instead you’d need command line options like -O2 -arch:AVX512 /fp:contract. Or -arch:AVX2 /fp:contract also enables 256-bit FMA, despite FMA3 and AVX2 being separate extensions.

    Also, /fp:fast allows other FP optimization, like treating FP math as associative, so it could auto-vectorize a dot product, for example on Godbolt where it unrolls with two vector accumulators to hide some FP latency, unlike your manually-vectorized loop that will bottleneck at one vector per 4 clocks (VADDPS latency), unless cache/memory bandwidth is a worse bottleneck.

    But you probably compiled without those options, so the scalar C++ loop compiled to scalar asm using vmulss and vaddss. So a scalar asm equivalent to how your manually-vectorized loop compiled. (MSVC normally doesn’t optimize intrinsics, but it actually will contract _mm256_add_ps(_mm256_mul_ps(a,b), c) into _mm256_fmadd_ps(a,b,c). But I’m pretty sure it won’t unroll a manually vectorized loop with multiple __m512 or __m256 vector accumulators even with /fp:fast)

    Rocket Lake can do 2x 256-bit FP math operations per clock, or 1x 512-bit. It can do 2x 512-bit loads per clock cycle if they’re aligned and hit in L1d cache (32KiB per core). FP ADD/MUL/FMA each have 4 cycle latency (https://uops.info/ and https://agner.org/optimize/)

    The main bottlenecks for a dot product are:

    • Cache or DRAM bandwidth: 2 loads per clock if hot in L1d cache, less if data is coming from L2 or farther away. See also Intel’s optimization manual re: sustained bandwidth from various levels of cache. You can use perf counters to see misses, using Linux perf stat or whatever equivalent on other OSes, perhaps VTune.

      Also somewhat less if some of the loads are misaligned and cross cache-line boundaries (or worse, page boundaries). e.g. a split load has to access L1d cache twice, so it basically costs double. Alignment-not-required loads like _mm512_loadu_ps have no extra cost if the data is actually aligned at run-time.

      But if data isn’t already hot in L1d, there will be some idle L1d cycles anyway as L2 can’t deliver 2 lines per cycle, in fact can’t sustain 1 per cycle. So misaligned arrays only slow down a loop by maybe 2% for 256-bit vectors when data’s coming from DRAM or maybe L3.

      But for 512-bit vectors where every misaligned load must split across cache lines, it’s about 15% lower DRAM bandwidth on Skylake-AVX512, probably running out of line-split buffers and not being able to keep as many loads in flight to max out the memory pipeline. Rocket Lake "client" chips might be different, being newer and having lower-latency uncore since they’re "client" not "server" chips.

    • FP latency of 4 cycles from starting a vaddps or vfma...ps until the result is ready for the next operation. This forms a dependency chain which out-of-order exec can’t hide, so you’re limited to one vector (or scalar) per 4 clocks. Unless you use multiple independent __m512 sum0, sum1, sum2, sum3 variables, and add them together at the end. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for more details and benchmark results with 256-bit vectors on Skylake, where 8 or even 12 vector accumulators helped get close to CPU throughput limits.

      With multiple short dot-products, like at the left side of your graphs, out-of-order execution may be overlapping work across iterations of the outer repeat loop, finding instruction-level parallelism between FP adds which doesn’t exist within one longer dot product. This may be part of how you’re getting speedups greater than 16x. Using FMAs instead of separate mul+add would increase the amount of benefit possible, both from increased peak FLOPs and from having fewer uops for the same amount of work making it easier for out-of-order exec to see farther.

      If your dot products were better optimized, there wouldn’t be so much left to gain, but your naive scalar version bottlenecks on one add per 4 cycles, as well as only doing one element per add instruction.

    • FMA or MUL/ADD throughput: well actually, each FMA (or separate MUL/ADD pair) requires 2 loads, so a well-optimized dot product will bottleneck on 2 loads per clock and thus one FMA per clock.

      For Rocket Lake with 512-bit vectors, 1 FMA per clock is actually the max, unlike with 256-bit or narrower where the FP ALUs on port 0 and 1 work separately to start 2x MUL/ADD/SUB/FMA per clock cycle. (With 4 cycle latency, fully pipelined, so up to 8 operations in flight). So once you unroll with multiple accumulators, there is actually throughput to be gained from replacing MUL+ADD with FMA, at least when data is hot in L1d cache.

      On a chip like Cascade Lake or Ice Lake-Xeon with a 2nd 512-bit FMA unit, or with scalar, 128, or 256-bit vectors, in theory there’d be little to gain, only a reduced number of uops to keep up with the loads, where both could be maxed out running vmulps zmm and vaddps zmm. But that would make more heat which could limit turbo clocks.

      On later chips like Alder Lake / Sapphire Rapids which can do 3 loads per clock but still only 2 FP math ops per clock, those 3 loads can feed an average of 1.5 FMAs per clock, and using separate mul+add would be a real bottleneck even with perfect scheduling of uops. Of course assuming you unrolled enough to hide FP latency.


    1. Array a512IRCurr is not modified for life of the run, 5) Array a512In is a circular queue with only 1 float written per run.

    (5) If you’re only writing one float per run, (a) isn’t it inconvenient to have a vector<__m512> instead of vector<float, custom_aligned_allocator>

    (Also, @chtz suggests that if you need multiple dot-products over this data, you should use an FFT to do the convolution.)

    I have 16 copies of a512IRCurr, each one shifted by one float, on the (apparently mistaken!?) impression that these accesses HAD to be aligned.

    Indeed that’s terrible, except when your data is really small so all copies fit in L1d cache then you get an easy win of aligned loads.

    Loading misaligned data is worth the cost when your arrays are large enough to not fit in L1d cache, as your graph shows. Or perhaps do aligned loads but then get the unaligned window of data you want using valignd. With one FMA per clock, port 5 is free to run a different 512-bit uop every cycle, a shuffle. Except valignd only works with an immediate shift count, making it painful, like you’d need 16 versions of the loop. Instead, you might use vperm2tps (_mm512_permutex2var_ps) with a shuffle-control vector to select the data you want. (The control vector itself could be loaded with one unaligned load from an array of int[] 0..31, perhaps with vpmovzxbd so it’s only a 16-byte load and won’t cross a cache-line boundary). vpermt2ps zmm is single-uop on Intel, and Zen 4. https://uops.info/

    I declared the vector to be of __m512 then made float* af = (float*) &... to allow easy access to individual elements.

    float* af = (float*) &... is strict-aliasing UB, unless you use a GNU C extension like typedef float aliasing_float __attribute__((may_alias)). MSVC always works like gcc/clang -fno-strict-aliasing so your code is actually safe there, but not portable.

    See GCC AVX __m256i cast to int array leads to wrong values for a real-life case of breakage. It might actually be safe for float onto __m512 because GNU C defines __m512 as a vector of float elements, but I wouldn’t count on it being fully safe. I’d still go with an aligned vector or array of float, even though it’s somewhat of a pain to get C++ to do over-aligned allocations for a std::vector, like you need a custom allocator as the 2nd template parameter, and the standard library doesn’t come with one.

    I updated the graph in my answer adding a hand-unrolled loop (just 8 multiplies and adds in a row). I’m surprised at how fast it is, and astonished it’s faster than AVX for up to 100 floats. I’m equally astonished that aligned AVX exceeds 16x faster; how the heck is that possible?

    Hard to say without seeing the asm or at least the C++ for it, or compilation options. If you used multiple sum variables (accumulators), perhaps it even got auto-vectorized by the compiler using 256-bit vectors?

    If not, 1 scalar FMA (or mul/add) per clock (2 loads / clock, half of max FMA throughput) is half the speed of 1 vector of 8 FMAs per 4 clocks (FMA latency), or 1/4 the speed of 1 vector of 16 FMAs per 4 clocks. But perhaps the compiler does a better job with cleanup than you did for a horizontal sum? You didn’t show that part.

    Of course to run fast for small to medium arrays (many vectors but fitting in L1d), you want 8 to 12 __m256 or __m512 vector accumulators. But that means the size threshold for entering the vectorized loop is higher, and leaves more possible cleanup work.

    If small sizes matter, having a 1-vector cleanup loop as well as a scalar cleanup loop is good.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search