The task is to sum the products of multiplying each float in array A with the corresponding element in array B. The arrays could have tens of thousands of elements, and must run say 100,000x sec to handle a real-time data stream, so performance is key.
I’ve coded it using regular math and again with AVX512. It is about 10.6x faster, which is a surprise, as I expected 16x or so given that I’m doing 16x operations per instruction. Furthermore while the loop has various overhead (e.g., looping variables, increments, branch if continuing loop, etc.) it is doing 1/16th of those compared to the naive version.
I’m compiling in Visual Studio 2022 Community, in release mode, and running on an i7-11700F.
Here’s the lines of code. I basically step through the two arrays 16 elements at a time, multiply the respective elements, and keep 16 running sums. At the very end of calculation I use _mm512_reduce_add_ps()
to sum those 16 sums.
vector<__m512> a512In;
vector<__m512> a512IRCurr;
__m512 fOut = _mm512_set1_ps( 0.0 );
for ( iSample = 0; iSample < iIterations; iSample++ )
fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
a512IRCurr[ iSample ] ) );
I see vmobups
doesn’t assume the target is aligned, and wonder if that’s the problem. However I also see that the unaligned versions have been the same speed for many generations as the aligned versions, but a troubling note that latency may still differ: https://community.intel.com/t5/Intel-ISA-Extensions/what-are-the-performance-implications-of-using-vmovups-and/m-p/1143448 While I’m comfortable with machine language of the 6502 variety I don’t know modern Intel.
I also wonder if the _mm512_add_ps
is the right instruction for a = a * b
constructs, or whether there’s a faster a *= b
type instruction.
for ( iSample = 0; iSample < iIterations; iSample++ )
00007FF6677B2958 movsxd r8,edi
00007FF6677B295B test edi,edi
00007FF6677B295D jle Circle2AVR512::ProcessInput+0AEh (07FF6677B298Eh)
fOut = _mm512_add_ps( fOut, _mm512_mul_ps( a512In[ iPos++ ],
00007FF6677B295F movsxd r9,eax
00007FF6677B2962 mov rdx,r11
00007FF6677B2965 shl r9,6
00007FF6677B2969 mov ecx,edi
00007FF6677B296B sub r9,r11
00007FF6677B296E add r9,qword ptr [r10]
00007FF6677B2971 vmovups zmm0,zmmword ptr [r9+rdx]
00007FF6677B2978 vmulps zmm1,zmm0,zmmword ptr [rdx]
00007FF6677B297E lea rdx,[rdx+40h]
00007FF6677B2982 vaddps zmm2,zmm2,zmm1
00007FF6677B2988 sub r8,1
00007FF6677B298C jne Circle2AVR512::ProcessInput+91h (07FF6677B2971h)
a512IRCurr[ iSample ] ) );
2
Answers
The TLDR: it's memory speed. My L1 cache supports 16-18x speedup, L2 about 10-12x, and L3 about 4.3x, when each is compared to a naive single-data-per-instruction C++ implementation. Once data no longer fits in L1, my CPU cannot fully utilize the AVX512 instructions' parallelism. That's the narrow answer to my narrow question.
The more in-depth answer: I'm sheepish to take credit for answering when so I've gotten so many ideas from comments on the post, but here are my findings.
The program processes input data sample by sample, which is stored in a circular queue, and for every sample multiplies the last N samples of input times the corresponding value in a second array called an impulse response. To be clear, I multiply the newest input sample with cell 0 in the impulse response array, the second-newest with cell 1 and so on. The newest input sample I multiply by cell 0 in the impulse response this time, will be multiplied by cell 1 instead when I'm called again for the next input sample, and so on. All of these products are summed, and that sum is the output of the function.
I made a dozen variations on this matrix multiplication, and use as the baseline one I called "Circle2" (a type of circular queue that doesn't check for wraparound with every operation, rather is composed of two loops whose starting point and iteration count are calculated before entering the loops). Circle2AVX512 basically keeps the same data structures but steps by multiples of 16, and casts
float*
's to__m512*
. Accesses on the input queue happen to still be aligned but accesses to the impulse response are mostly non-aligned. Circle2AVX512Aligned keeps 16 copies of the impulse response, one at each possible memory alignment. Accesses again are always aligned on the circular queue, but now also aligned on the impulse response, by selecting the edition of the impulse response with the desired alignment.My CPU, the i7-11700F, has 32kb L1D (D=data) cache and 256kb L2 cache per core. Then, 16M L3 cache serves the whole computer. (I was web-surfing, youtubing and word processing so not all of this was necessarily available to the test.)
Looking at Circle2AVX512, my single-threaded app was able to get 16-18x speedup over a naively-programmed though optimized C++ implementation, as long as the data (input buffer plus impulse response) is big enough to overcome the fixed overhead but small enough to fit in L1--32k total, or in other words, 16k each for each of two structures: 4000 samples each, in 4-byte floats.
Then performance fell to a lower plateau of 10-12x improvement over naive single-data math, until the two structures grew past 256k, at which L2 was no longer sufficient.
At that point it maintains a steady 8.4x to 9.0x speedup over the naive code at least up to 6.4M (two 3.2M data structures).
A future direction of research would be to see how it performs when the L3 cache is done.
Note the access pattern of memory is the classic worst-case for an MRU cache: linear access, mockingly dropping data from the cache just before you're about to finally get back to it. However, as a direction for further work, the software could also take advantage of the classic-best-case for an MRU cache: extreme locality of reference. If the software were presented with multiple input samples in a group, this could be highly optimized by utilizing the same ranges of both the input queue and the impulse response before moving on, and thus achieving extremely high locality of reference. Calculating the first sample would still have to re-load all data from cache but the next group of input samples could free-ride off that loaded cache.
Frankly I don't understand why the Circle2AVX512Aligned algo is performing as well as we see here. Instead of one copy of the impulse response, it has 16 completely separate copies. So it should run out of L1 cache at about 3.75k per data structure (again, the input queue is the same size in bytes as the impulse response; there's simply 16 impulse responses instead of one). And indeed, that's where we see it switch from significantly out-performing the non-aligned version, to underperforming. And yet I'm still confused why, on this modern CPU that sources tell me should be able to access unaligned memory as fast, why the aligned version is so much faster. And I'm also confused why the aligned version doesn't get far slower than it does. My only guess is that the input queue stays nailed to the cache so at least we have that going for us.
Likewise the Circle2AVX512Aligned no longer fits into L2 cache at 256k/17 when the individual structures are 256k/17=9.17k each. OK, a gap does open up at that point but not as big as I'd have expected.
Finally the Circle2AVX512Aligned no longer fits into L3 cache at 16M/17 when the individual structures are 16M/17=963k each, or so. And this time, much clearer degradation.
Background facts: mul+add can be done as one FMA, (Fused Multiply-Add), e.g.
_mm512_fmadd_ps
MSVC defaults to
#pragma STDC FP_CONTRACT off
, not contractinga*b+c
intofma(a,b,c)
. And doesn’t recognize pragmas for it, instead you’d need command line options like-O2 -arch:AVX512 /fp:contract
. Or-arch:AVX2 /fp:contract
also enables 256-bit FMA, despite FMA3 and AVX2 being separate extensions.Also,
/fp:fast
allows other FP optimization, like treating FP math as associative, so it could auto-vectorize a dot product, for example on Godbolt where it unrolls with two vector accumulators to hide some FP latency, unlike your manually-vectorized loop that will bottleneck at one vector per 4 clocks (VADDPS latency), unless cache/memory bandwidth is a worse bottleneck.But you probably compiled without those options, so the scalar C++ loop compiled to scalar asm using
vmulss
andvaddss
. So a scalar asm equivalent to how your manually-vectorized loop compiled. (MSVC normally doesn’t optimize intrinsics, but it actually will contract_mm256_add_ps(_mm256_mul_ps(a,b), c)
into_mm256_fmadd_ps(a,b,c)
. But I’m pretty sure it won’t unroll a manually vectorized loop with multiple__m512
or__m256
vector accumulators even with/fp:fast
)Rocket Lake can do 2x 256-bit FP math operations per clock, or 1x 512-bit. It can do 2x 512-bit loads per clock cycle if they’re aligned and hit in L1d cache (32KiB per core). FP ADD/MUL/FMA each have 4 cycle latency (https://uops.info/ and https://agner.org/optimize/)
The main bottlenecks for a dot product are:
Cache or DRAM bandwidth: 2 loads per clock if hot in L1d cache, less if data is coming from L2 or farther away. See also Intel’s optimization manual re: sustained bandwidth from various levels of cache. You can use perf counters to see misses, using Linux
perf stat
or whatever equivalent on other OSes, perhaps VTune.Also somewhat less if some of the loads are misaligned and cross cache-line boundaries (or worse, page boundaries). e.g. a split load has to access L1d cache twice, so it basically costs double. Alignment-not-required loads like
_mm512_loadu_ps
have no extra cost if the data is actually aligned at run-time.But if data isn’t already hot in L1d, there will be some idle L1d cycles anyway as L2 can’t deliver 2 lines per cycle, in fact can’t sustain 1 per cycle. So misaligned arrays only slow down a loop by maybe 2% for 256-bit vectors when data’s coming from DRAM or maybe L3.
But for 512-bit vectors where every misaligned load must split across cache lines, it’s about 15% lower DRAM bandwidth on Skylake-AVX512, probably running out of line-split buffers and not being able to keep as many loads in flight to max out the memory pipeline. Rocket Lake "client" chips might be different, being newer and having lower-latency uncore since they’re "client" not "server" chips.
FP latency of 4 cycles from starting a
vaddps
orvfma...ps
until the result is ready for the next operation. This forms a dependency chain which out-of-order exec can’t hide, so you’re limited to one vector (or scalar) per 4 clocks. Unless you use multiple independent__m512 sum0, sum1, sum2, sum3
variables, and add them together at the end. See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for more details and benchmark results with 256-bit vectors on Skylake, where 8 or even 12 vector accumulators helped get close to CPU throughput limits.With multiple short dot-products, like at the left side of your graphs, out-of-order execution may be overlapping work across iterations of the outer repeat loop, finding instruction-level parallelism between FP adds which doesn’t exist within one longer dot product. This may be part of how you’re getting speedups greater than 16x. Using FMAs instead of separate mul+add would increase the amount of benefit possible, both from increased peak FLOPs and from having fewer uops for the same amount of work making it easier for out-of-order exec to see farther.
If your dot products were better optimized, there wouldn’t be so much left to gain, but your naive scalar version bottlenecks on one add per 4 cycles, as well as only doing one element per add instruction.
FMA or MUL/ADD throughput: well actually, each FMA (or separate MUL/ADD pair) requires 2 loads, so a well-optimized dot product will bottleneck on 2 loads per clock and thus one FMA per clock.
For Rocket Lake with 512-bit vectors, 1 FMA per clock is actually the max, unlike with 256-bit or narrower where the FP ALUs on port 0 and 1 work separately to start 2x MUL/ADD/SUB/FMA per clock cycle. (With 4 cycle latency, fully pipelined, so up to 8 operations in flight). So once you unroll with multiple accumulators, there is actually throughput to be gained from replacing MUL+ADD with FMA, at least when data is hot in L1d cache.
On a chip like Cascade Lake or Ice Lake-Xeon with a 2nd 512-bit FMA unit, or with scalar, 128, or 256-bit vectors, in theory there’d be little to gain, only a reduced number of uops to keep up with the loads, where both could be maxed out running
vmulps zmm
andvaddps zmm
. But that would make more heat which could limit turbo clocks.On later chips like Alder Lake / Sapphire Rapids which can do 3 loads per clock but still only 2 FP math ops per clock, those 3 loads can feed an average of 1.5 FMAs per clock, and using separate mul+add would be a real bottleneck even with perfect scheduling of uops. Of course assuming you unrolled enough to hide FP latency.
(5) If you’re only writing one float per run, (a) isn’t it inconvenient to have a
vector<__m512>
instead ofvector<float, custom_aligned_allocator>
(Also, @chtz suggests that if you need multiple dot-products over this data, you should use an FFT to do the convolution.)
Indeed that’s terrible, except when your data is really small so all copies fit in L1d cache then you get an easy win of aligned loads.
Loading misaligned data is worth the cost when your arrays are large enough to not fit in L1d cache, as your graph shows. Or perhaps do aligned loads but then get the unaligned window of data you want using
valignd
. With one FMA per clock, port 5 is free to run a different 512-bit uop every cycle, a shuffle. Exceptvalignd
only works with an immediate shift count, making it painful, like you’d need 16 versions of the loop. Instead, you might usevperm2tps
(_mm512_permutex2var_ps
) with a shuffle-control vector to select the data you want. (The control vector itself could be loaded with one unaligned load from an array ofint[]
0..31, perhaps withvpmovzxbd
so it’s only a 16-byte load and won’t cross a cache-line boundary).vpermt2ps zmm
is single-uop on Intel, and Zen 4. https://uops.info/float* af = (float*) &...
is strict-aliasing UB, unless you use a GNU C extension liketypedef float aliasing_float __attribute__((may_alias))
. MSVC always works likegcc/clang -fno-strict-aliasing
so your code is actually safe there, but not portable.See GCC AVX __m256i cast to int array leads to wrong values for a real-life case of breakage. It might actually be safe for
float
onto__m512
because GNU C defines__m512
as a vector of float elements, but I wouldn’t count on it being fully safe. I’d still go with an aligned vector or array offloat
, even though it’s somewhat of a pain to get C++ to do over-aligned allocations for astd::vector
, like you need a custom allocator as the 2nd template parameter, and the standard library doesn’t come with one.Hard to say without seeing the asm or at least the C++ for it, or compilation options. If you used multiple
sum
variables (accumulators), perhaps it even got auto-vectorized by the compiler using 256-bit vectors?If not, 1 scalar FMA (or mul/add) per clock (2 loads / clock, half of max FMA throughput) is half the speed of 1 vector of 8 FMAs per 4 clocks (FMA latency), or 1/4 the speed of 1 vector of 16 FMAs per 4 clocks. But perhaps the compiler does a better job with cleanup than you did for a horizontal sum? You didn’t show that part.
Of course to run fast for small to medium arrays (many vectors but fitting in L1d), you want 8 to 12
__m256
or__m512
vector accumulators. But that means the size threshold for entering the vectorized loop is higher, and leaves more possible cleanup work.If small sizes matter, having a 1-vector cleanup loop as well as a scalar cleanup loop is good.