Why is np.dot so much faster than np.sum? Following this answer we know that np.sum is slow and has faster alternatives.
For example:
In [20]: A = np.random.rand(1000)
In [21]: B = np.random.rand(1000)
In [22]: %timeit np.sum(A)
3.21 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [23]: %timeit A.sum()
1.7 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [24]: %timeit np.add.reduce(A)
1.61 µs ± 19.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
But all of them are slower than:
In [25]: %timeit np.dot(A,B)
1.18 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Given that np.dot is both multiplying two arrays elementwise and then summing them, how can this be faster than just summing one array? If B were set to the all ones array then np.dot would simply be summing A.
So it seems the fastest option to sum A is:
In [26]: O = np.ones(1000)
In [27]: %timeit np.dot(A,O)
1.16 µs ± 6.37 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
This can’t be right, can it?
This is on Ubuntu with numpy 1.24.2 using openblas64 on Python 3.10.6.
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
Update
The order of the timings reverses if the array is much longer. That is:
In [28]: A = np.random.rand(1000000)
In [29]: O = np.ones(1000000)
In [30]: %timeit np.dot(A,O)
545 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [31]: %timeit np.sum(A)
429 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [32]: %timeit A.sum()
404 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [33]: %timeit np.add.reduce(A)
401 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This implies to me that there is some fixed sized overhead when calling np.sum(A), A.sum(), np.add.reduce(A) that doesn’t exist when calling np.dot() but the part of the code that does the summation is in fact faster.
——————————-
Any speed ups using cython, numba, python etc would be great to see.
2
Answers
numpy.dot
delegates to a BLAS vector-vector multiply here, whilenumpy.sum
uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.I don’t know what BLAS library your NumPy is using, but a good BLAS will generally take advantage of SIMD operations, while
numpy.sum
doesn’t do that, as far as I can see. Any SIMD usage in thenumpy.sum
code would have to be compiler autovectorization, which is likely less efficient than what the BLAS does.When you raise the array sizes to 1 million elements, at that point, you’re probably hitting a cache threshold. The
dot
code is working with about 16 MB of data, and thesum
code is working with about 8 MB. Thedot
code might be getting data bumped to a slower cache level or to RAM, or perhaps bothdot
andsum
are working with a slower cache level anddot
is performing worse because it needs to read more data. If I try raising the array sizes gradually, the timings are more consistent with some sort of threshold effect than withsum
having higher per-element performance.This answer completes the good answer of @user2357112 by providing additional details. Both functions are optimized. That being said the pair-wise summation is generally a bit slower while providing generally a more accurate result. It is also sub-optimal yet though relatively good. OpenBLAS which is used by default on Windows does not perform a pair-wise summation.
Here is the assembly code for the Numpy code:
Here is the assembly code for the OpenBLAS code:
The main issue with the Numpy code is that it does not use AVX (256-bit SIMD instruction set) but SSE (128-bit SIMD instruction set) as opposed to OpenBLAS, at least at the version 1.22.4 (the one I use) and before. Even worst: the instructions are scalar one in the Numpy code! We recently worked on this and the last version of Numpy should now use AVX. That being said, it may still be not as fast as OpenBLAS because of the pair-wise summation (especially for big arrays).
Note that both functions spent a non-negligible time in overheads because of the arrays being too small. Such overheads can be removed using a hand-written implementation in Numba.
This is expected. Indeed, functions are rather compute-bound when they operate in the cache, but they become memory-bound when the array is big and fit in the L3 cache or even the RAM. As a result,
np.dot
stats to be slower for bigger array since it needs to read twice bigger data from memory. More specifically, it needs to read8*1000000*2/1024**2 ~= 15.3 MiB
from memory so you likely need to read data from your RAM which have a pretty limited throughout. In fact, a good bi-channel 3200 MHz DDR4 RAM like mine can reach a practical throughput close to 40 GiB and15.3/(40*1024) ~= 374 µs
. That being said, sequential codes can hardly completely saturate this throughput so reaching a 30 GiB/s in sequential is already great not to mention many mainstream PC RAM operate at a lower frequency. A 30 GHz/s throughput result in ~500 µs which is close to your timings. Meanwhile,np.sum
andnp.add.reduce
are more compute-bound because of their inefficient implementation, but the amount of data to be read is twice smaller and may actually fit better in the L3 cache having a significantly bigger throughput.To prove this effect, you can simply try to run:
On my machine, the L3 has only a size of 9 MiB so the second call needs not only to read twice more data but also read it more from the slower RAM than from the L3 cache.
For small array, the L1 cache is so fast that reading data should not be a bottleneck. On my i5-9600KF machine, the throughput of the L1 cache is huge : ~268 GiB/s. This means the optimal time to read two arrays of size 1000 is
8*1000*2/(268*1024**3) ~= 0.056 µs
. In practice, the overhead of calling a Numpy function is much bigger than that.Fast implementation
Here is a fast Numba implementation:
Here are performance results:
If you use the flag
parallel=True
andnb.prange
instead ofrange
, Numba will use multiple threads. This is good for large arrays, but it may not be for small ones on some machine (due to the overhead to create threads and share the work):As expected, Numba can be faster for small array (because the Numpy call overhead is mostly removed) and be competitive with OpenBLAS for large array. The code generated by Numba is pretty efficient:
That being said, it is not optimal : the LLVM-Lite JIT compiler uses a 4x unrolling while a 8x unrolling should be optimal on my Intel CoffeeLake processor. Indeed, the latency of the
vaddpd
instruction is 4 cycle while 2 instruction can be executed per cycle so 8 registers are needed to avoid a stall and the resulting code being latency bound. Besides, this assembly code is optimal on Intel Alderlake and Sapphire Rapids processors since they have a twice lowervaddpd
latency. Saturating FMA SIMD processing units is far from being easy. I think the only way to write a faster function is to write a (C/C++) native code using SIMD intrinsics though it is less portable.Note the Numba code does not support special numbers like NaN or Inf value because of
fastmath
(AFAIK OpenBLAS does). In practice, it should still work on x86-64 machines but this is not guaranteed. Besides, the Numba code is not numerically stable for very large arrays. The Numpy code should be the most numerically stable of the three variants (then the OpenBLAS code). You can compute the sum by chunk to improve the numerical stability though it makes the code more complex. There is not free lunch.