skip to Main Content

Why is np.dot so much faster than np.sum? Following this answer we know that np.sum is slow and has faster alternatives.

For example:

In [20]: A = np.random.rand(1000)

In [21]: B = np.random.rand(1000)

In [22]: %timeit np.sum(A)
3.21 µs ± 270 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [23]: %timeit A.sum()
1.7 µs ± 11.5 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [24]: %timeit np.add.reduce(A)
1.61 µs ± 19.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

But all of them are slower than:

In [25]: %timeit np.dot(A,B)
1.18 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Given that np.dot is both multiplying two arrays elementwise and then summing them, how can this be faster than just summing one array? If B were set to the all ones array then np.dot would simply be summing A.

So it seems the fastest option to sum A is:

In [26]: O = np.ones(1000)
In [27]: %timeit np.dot(A,O)
1.16 µs ± 6.37 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

This can’t be right, can it?

This is on Ubuntu with numpy 1.24.2 using openblas64 on Python 3.10.6.

Supported SIMD extensions in this NumPy install:

baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2

Update

The order of the timings reverses if the array is much longer. That is:

In [28]: A = np.random.rand(1000000)
In [29]: O = np.ones(1000000)
In [30]: %timeit np.dot(A,O)
545 µs ± 8.87 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [31]: %timeit np.sum(A)
429 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)    
In [32]: %timeit A.sum()
404 µs ± 2.95 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [33]: %timeit np.add.reduce(A)
401 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This implies to me that there is some fixed sized overhead when calling np.sum(A), A.sum(), np.add.reduce(A) that doesn’t exist when calling np.dot() but the part of the code that does the summation is in fact faster.

——————————-

Any speed ups using cython, numba, python etc would be great to see.

2

Answers


  1. numpy.dot delegates to a BLAS vector-vector multiply here, while numpy.sum uses a pairwise summation routine, switching over to an 8x unrolled summation loop at a block size of 128 elements.

    I don’t know what BLAS library your NumPy is using, but a good BLAS will generally take advantage of SIMD operations, while numpy.sum doesn’t do that, as far as I can see. Any SIMD usage in the numpy.sum code would have to be compiler autovectorization, which is likely less efficient than what the BLAS does.

    When you raise the array sizes to 1 million elements, at that point, you’re probably hitting a cache threshold. The dot code is working with about 16 MB of data, and the sum code is working with about 8 MB. The dot code might be getting data bumped to a slower cache level or to RAM, or perhaps both dot and sum are working with a slower cache level and dot is performing worse because it needs to read more data. If I try raising the array sizes gradually, the timings are more consistent with some sort of threshold effect than with sum having higher per-element performance.

    Login or Signup to reply.
  2. This answer completes the good answer of @user2357112 by providing additional details. Both functions are optimized. That being said the pair-wise summation is generally a bit slower while providing generally a more accurate result. It is also sub-optimal yet though relatively good. OpenBLAS which is used by default on Windows does not perform a pair-wise summation.

    Here is the assembly code for the Numpy code:

    enter image description here

    Here is the assembly code for the OpenBLAS code:

    enter image description here

    The main issue with the Numpy code is that it does not use AVX (256-bit SIMD instruction set) but SSE (128-bit SIMD instruction set) as opposed to OpenBLAS, at least at the version 1.22.4 (the one I use) and before. Even worst: the instructions are scalar one in the Numpy code! We recently worked on this and the last version of Numpy should now use AVX. That being said, it may still be not as fast as OpenBLAS because of the pair-wise summation (especially for big arrays).

    Note that both functions spent a non-negligible time in overheads because of the arrays being too small. Such overheads can be removed using a hand-written implementation in Numba.

    The order of the timings reverses if the array is much longer.

    This is expected. Indeed, functions are rather compute-bound when they operate in the cache, but they become memory-bound when the array is big and fit in the L3 cache or even the RAM. As a result, np.dot stats to be slower for bigger array since it needs to read twice bigger data from memory. More specifically, it needs to read 8*1000000*2/1024**2 ~= 15.3 MiB from memory so you likely need to read data from your RAM which have a pretty limited throughout. In fact, a good bi-channel 3200 MHz DDR4 RAM like mine can reach a practical throughput close to 40 GiB and 15.3/(40*1024) ~= 374 µs. That being said, sequential codes can hardly completely saturate this throughput so reaching a 30 GiB/s in sequential is already great not to mention many mainstream PC RAM operate at a lower frequency. A 30 GHz/s throughput result in ~500 µs which is close to your timings. Meanwhile, np.sum and np.add.reduce are more compute-bound because of their inefficient implementation, but the amount of data to be read is twice smaller and may actually fit better in the L3 cache having a significantly bigger throughput.

    To prove this effect, you can simply try to run:

    # L3 cache of 9 MiB
    
    # 2 x 22.9 = 45.8 MiB
    a = np.ones(3_000_000)
    b = np.ones(3_000_000)
    %timeit -n 100 np.dot(a, a)   #  494 µs => read from RAM
    %timeit -n 100 np.dot(a, b)   # 1007 µs => read from RAM
    
    # 2 x 7.6 = 15.2 MiB
    a = np.ones(1_000_000)
    b = np.ones(1_000_000)
    %timeit -n 100 np.dot(a, a)   #  90 µs => read from the L3 cache
    %timeit -n 100 np.dot(a, b)   # 283 µs => read from RAM
    
    # 2 x 1.9 = 3.8 MiB
    a = np.ones(250_000)
    b = np.ones(250_000)
    %timeit -n 100 np.dot(a, a)   # 40 µs => read from the L3 cache (quite compute-bound)
    %timeit -n 100 np.dot(a, b)   # 46 µs => read from the L3 cache too (quite memory-bound)
    

    On my machine, the L3 has only a size of 9 MiB so the second call needs not only to read twice more data but also read it more from the slower RAM than from the L3 cache.

    For small array, the L1 cache is so fast that reading data should not be a bottleneck. On my i5-9600KF machine, the throughput of the L1 cache is huge : ~268 GiB/s. This means the optimal time to read two arrays of size 1000 is 8*1000*2/(268*1024**3) ~= 0.056 µs. In practice, the overhead of calling a Numpy function is much bigger than that.


    Fast implementation

    Here is a fast Numba implementation:

    import numba as nb
    
    # Function eagerly compiled only for 64-bit contiguous arrays
    @nb.njit('float64(float64[::1],)', fastmath=True)
    def fast_sum(arr):
        s = 0.0
        for i in range(arr.size):
            s += arr[i]
        return s
    

    Here are performance results:

     array items |    time    |  speedup (dot/numba_seq)
    --------------------------|------------------------
     3_000_000   |   870 µs   |   x0.57
     1_000_000   |   183 µs   |   x0.49
       250_000   |    29 µs   |   x1.38
    

    If you use the flag parallel=True and nb.prange instead of range, Numba will use multiple threads. This is good for large arrays, but it may not be for small ones on some machine (due to the overhead to create threads and share the work):

     array items |    time    |  speedup (dot/numba_par)
    --------------------------|--------------------------
     3_000_000   |   465 µs   |   x1.06
     1_000_000   |    66 µs   |   x1.36
       250_000   |    10 µs   |   x4.00
    

    As expected, Numba can be faster for small array (because the Numpy call overhead is mostly removed) and be competitive with OpenBLAS for large array. The code generated by Numba is pretty efficient:

    .LBB0_7:
            vaddpd  (%r9,%rdx,8), %ymm0, %ymm0
            vaddpd  32(%r9,%rdx,8), %ymm1, %ymm1
            vaddpd  64(%r9,%rdx,8), %ymm2, %ymm2
            vaddpd  96(%r9,%rdx,8), %ymm3, %ymm3
            vaddpd  128(%r9,%rdx,8), %ymm0, %ymm0
            vaddpd  160(%r9,%rdx,8), %ymm1, %ymm1
            vaddpd  192(%r9,%rdx,8), %ymm2, %ymm2
            vaddpd  224(%r9,%rdx,8), %ymm3, %ymm3
            vaddpd  256(%r9,%rdx,8), %ymm0, %ymm0
            vaddpd  288(%r9,%rdx,8), %ymm1, %ymm1
            vaddpd  320(%r9,%rdx,8), %ymm2, %ymm2
            vaddpd  352(%r9,%rdx,8), %ymm3, %ymm3
            vaddpd  384(%r9,%rdx,8), %ymm0, %ymm0
            vaddpd  416(%r9,%rdx,8), %ymm1, %ymm1
            vaddpd  448(%r9,%rdx,8), %ymm2, %ymm2
            vaddpd  480(%r9,%rdx,8), %ymm3, %ymm3
            addq    $64, %rdx
            addq    $-4, %r11
            jne     .LBB0_7
    

    That being said, it is not optimal : the LLVM-Lite JIT compiler uses a 4x unrolling while a 8x unrolling should be optimal on my Intel CoffeeLake processor. Indeed, the latency of the vaddpd instruction is 4 cycle while 2 instruction can be executed per cycle so 8 registers are needed to avoid a stall and the resulting code being latency bound. Besides, this assembly code is optimal on Intel Alderlake and Sapphire Rapids processors since they have a twice lower vaddpd latency. Saturating FMA SIMD processing units is far from being easy. I think the only way to write a faster function is to write a (C/C++) native code using SIMD intrinsics though it is less portable.

    Note the Numba code does not support special numbers like NaN or Inf value because of fastmath (AFAIK OpenBLAS does). In practice, it should still work on x86-64 machines but this is not guaranteed. Besides, the Numba code is not numerically stable for very large arrays. The Numpy code should be the most numerically stable of the three variants (then the OpenBLAS code). You can compute the sum by chunk to improve the numerical stability though it makes the code more complex. There is not free lunch.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search