skip to Main Content

The following two functions logically do the same thing, but func2, which stores the result in a temporary variable, is slower.

def func1(a):
    return a.astype(np.float64) / 255.0


def func2(a):
    t = a.astype(np.float64)
    return t / 255.0

Benchmark:

import sys
import timeit

import numpy as np


def func1(a):
    return a.astype(np.float64) / 255.0


def func2(a):
    t = a.astype(np.float64)
    return t / 255.0


def main():
    print(f"{np.__version__=}")
    print(f"{sys.version=}")

    size = 1_000_000
    n_repeats = 100
    n_epochs = 5
    a = np.random.randint(0, 256, size=size, dtype=np.uint8)

    for f in [func1, func2] * n_epochs:
        times = timeit.repeat(lambda: f(a), number=1, repeat=n_repeats)
        print(f"{f.__name__}: {min(times)}")


main()

Result:

np.__version__='1.26.4'
sys.version='3.12.4 (main, Jul 16 2024, 19:42:31) [GCC 11.4.0]'
func1: 0.0013509448617696762
func2: 0.0035865511745214462
func1: 0.0013513723388314247
func2: 0.0034992704167962074
func1: 0.0013509979471564293
func2: 0.003565799444913864
func1: 0.0013509783893823624
func2: 0.003563949838280678
func1: 0.0013510659337043762
func2: 0.003569650463759899

For a size of 1_000_000_000:

func1: 2.503432061523199
func2: 3.3956982269883156
func1: 2.503927574492991
func2: 3.393561664968729
func1: 2.5052043283358216
func2: 3.3980945963412523
func1: 2.503149318508804
func2: 3.39398608263582
func1: 2.5073573794215918
func2: 3.396817682310939

Although the relative difference has decreased, the absolute difference is now almost 1 second.
So, I assume there is some difference that affects the entire array, but I couldn’t figure out what it was.

Here are the bytecodes for the two functions printed by the dis module.

  8           0 RESUME                   0

  9           2 LOAD_FAST                0 (a)
              4 LOAD_ATTR                1 (NULL|self + astype)
             24 LOAD_GLOBAL              2 (np)
             34 LOAD_ATTR                4 (float64)
             54 CALL                     1
             62 LOAD_CONST               1 (255.0)
             64 BINARY_OP               11 (/)
             68 RETURN_VALUE
 12           0 RESUME                   0

 13           2 LOAD_FAST                0 (a)
              4 LOAD_ATTR                1 (NULL|self + astype)
             24 LOAD_GLOBAL              2 (np)
             34 LOAD_ATTR                4 (float64)
             54 CALL                     1
             62 STORE_FAST               1 (t)

 14          64 LOAD_FAST                1 (t)
             66 LOAD_CONST               1 (255.0)
             68 BINARY_OP               11 (/)
             72 RETURN_VALUE

As you can see, the only difference is STORE_FAST and LOAD_FAST of t, which I thought would not affect the entire array. What am I misunderstanding?

Please note that I was only able to reproduce this on Ubuntu (probably GCC), and not on Windows.
I can’t tell whether it’s OS-dependent or hardware-dependent.


UPDATE:

As @no-comment pointed out in the comment, the first timing (i.e. immediately after the function switches) is noticeably different from the later timings. So I think we can assume that it was affected by the previous call.

func1: 0.0014957496896386147 times[:5]=[0.003002448007464409, 0.002440572716295719, 0.0015078019350767136, 0.001591850072145462, 0.0014987429603934288]
func2: 0.0031856298446655273 times[:5]=[0.0031856298446655273, 0.0035560475662350655, 0.003505030646920204, 0.0035979915410280228, 0.0035021230578422546]
func1: 0.0014953771606087685 times[:5]=[0.002430872991681099, 0.0015014195814728737, 0.0014976952224969864, 0.0015024356544017792, 0.001497727818787098]
func2: 0.0031922338530421257 times[:5]=[0.0031922338530421257, 0.0035140514373779297, 0.003572382964193821, 0.003530893474817276, 0.003493628464639187]
func1: 0.0014957757666707039 times[:5]=[0.0024183401837944984, 0.001501905731856823, 0.0015066321939229965, 0.0015012985095381737, 0.0015163430944085121]
func2: 0.003169679082930088 times[:5]=[0.003169679082930088, 0.0035164309665560722, 0.0034868214279413223, 0.0034810323268175125, 0.003540727309882641]
func1: 0.0014958055689930916 times[:5]=[0.0024264072999358177, 0.0015287799760699272, 0.0014993362128734589, 0.001505572348833084, 0.001497725024819374]
func2: 0.003466635011136532 times[:5]=[0.0038485946133732796, 0.0054071033373475075, 0.0054869744926691055, 0.0034872666001319885, 0.003563432954251766]
func1: 0.0014960560947656631 times[:5]=[0.0024131694808602333, 0.001533709466457367, 0.0015180828049778938, 0.0015001073479652405, 0.0014980453997850418]
func2: 0.003209024667739868 times[:5]=[0.003209024667739868, 0.0037086624652147293, 0.0035465294495224953, 0.0034985439851880074, 0.00348656065762043]

UPDATE2:

I was able to confirm that func1 reuses memory for division using the following method.

def show_address(a):
    print(np.byte_bounds(a))
    return a


def func1_with_hook(a):
    return show_address(show_address(a.astype(np.float64)) / 255.0)


def func2_with_hook(a):
    t = a.astype(np.float64)
    return show_address(show_address(t) / 255.0)

However, I discovered something else in the process.

What I tried to do was to play around with memory outside of functions and investigate the changes in performance depending on whether memory (address) was reused or not. In the process, I realized that by storing the result of the function in a variable, func2 runs just as fast as func1.

def main():
    print(f"{np.__version__=}")
    print(f"{sys.version=}")
    print("-" * 50)

    size, n_repeats, n_epochs = 1_000_000, 100, 5
    a = np.random.randint(0, 256, size=size, dtype=np.uint8)

    for f in [func1, func2] * n_epochs:
        times = []
        epoch_started = time.perf_counter()
        for _ in range(n_repeats):
            started = time.perf_counter()
            # f(a)
            r = f(a)  # This makes func2 faster.
            times.append(time.perf_counter() - started)
        epoch_time = time.perf_counter() - epoch_started
        print(f"{f.__name__}: min={min(times):.6f} total={epoch_time:.4f} times={[round(t * 1000, 2) for t in times[:20]]}")


main()
func1: min=0.001484 total=0.1595 times=[4.68, 3.6, 2.95, 2.93, 1.62, 1.5, 1.52, 1.53, 1.7, 1.76, 1.57, 1.51, 1.5, 1.49, 1.57, 1.49, 1.51, 1.49, 1.5, 1.6]
func2: min=0.001506 total=0.1554 times=[2.89, 1.54, 1.53, 1.54, 1.51, 1.53, 1.54, 1.57, 1.6, 1.56, 1.54, 1.52, 1.52, 1.54, 1.54, 1.56, 1.55, 1.54, 1.56, 1.54]
func1: min=0.001485 total=0.1634 times=[2.01, 2.91, 1.51, 1.5, 1.5, 1.49, 1.5, 1.49, 1.5, 1.49, 1.5, 1.56, 1.56, 1.53, 1.52, 1.61, 1.64, 1.53, 1.5, 1.49]
func2: min=0.001516 total=0.1609 times=[2.94, 1.6, 1.53, 1.58, 1.53, 1.57, 1.59, 1.53, 1.54, 1.56, 1.57, 1.62, 1.55, 1.54, 1.55, 1.56, 1.58, 1.57, 1.58, 1.56]
func1: min=0.001485 total=0.1655 times=[1.93, 2.86, 1.52, 1.49, 1.49, 1.49, 1.5, 1.49, 1.52, 1.55, 1.56, 1.5, 1.5, 1.5, 1.5, 1.49, 1.51, 1.52, 1.51, 1.49]
func2: min=0.001507 total=0.1623 times=[2.87, 1.59, 1.56, 1.58, 1.58, 1.63, 1.66, 1.64, 1.64, 1.61, 1.59, 1.61, 1.58, 1.57, 1.59, 1.57, 1.55, 1.56, 1.56, 1.61]
func1: min=0.001485 total=0.1522 times=[1.93, 2.86, 1.51, 1.49, 1.5, 1.6, 1.52, 1.5, 1.5, 1.49, 1.5, 1.51, 1.5, 1.5, 1.5, 1.49, 1.51, 1.49, 1.51, 1.49]
func2: min=0.001506 total=0.1557 times=[2.88, 1.54, 1.53, 1.52, 1.51, 1.52, 1.51, 1.51, 1.52, 1.52, 1.59, 1.58, 1.59, 1.59, 1.54, 1.55, 1.52, 1.53, 1.53, 1.52]
func1: min=0.001486 total=0.1538 times=[1.89, 2.93, 1.62, 1.52, 1.51, 1.52, 1.52, 1.51, 1.51, 1.49, 1.5, 1.49, 1.5, 1.5, 1.5, 1.49, 1.5, 1.5, 1.51, 1.51]
func2: min=0.001506 total=0.1566 times=[2.89, 1.56, 1.53, 1.52, 1.55, 1.57, 1.58, 1.54, 1.53, 1.55, 1.54, 1.53, 1.53, 1.58, 1.62, 1.61, 1.6, 1.6, 1.58, 1.59]

Please note that I also measured the runtime of the entire loop (see the total above). It is expected that the timing of the deletion will change, but it should still be within the measured section.

This result brought me back to square one. It is most certainly a memory management issue, but what specific factor is causing it is quite confusing.

2

Answers


  1. Look at the byte code generated for the first:

    >>> dis.dis("""def func1(a):
    ...     return a.astype(np.float64) / 255.0
    ... """)
    [...]
    Disassembly of <code object func1 at 0x105979790, file "<dis>", line 1>:
      1           0 RESUME                   0
    
      2           2 LOAD_FAST                0 (a)
                  4 LOAD_ATTR                1 (NULL|self + astype)
                 24 LOAD_GLOBAL              2 (np)
                 34 LOAD_ATTR                4 (float64)
                 54 CALL                     1
                 62 LOAD_CONST               1 (255.0)
                 64 BINARY_OP               11 (/)
                 68 RETURN_VALUE
    

    versus the second

    >>> dis.dis("""def func2(a):
    ...     t = a.astype(np.float64)
    ...     return t / 255.0
    ... """)
    [...]
    
    Disassembly of <code object func2 at 0x105979790, file "<dis>", line 1>:
      1           0 RESUME                   0
    
      2           2 LOAD_FAST                0 (a)
                  4 LOAD_ATTR                1 (NULL|self + astype)
                 24 LOAD_GLOBAL              2 (np)
                 34 LOAD_ATTR                4 (float64)
                 54 CALL                     1
                 62 STORE_FAST               1 (t)
    
      3          64 LOAD_FAST                1 (t)
                 66 LOAD_CONST               1 (255.0)
                 68 BINARY_OP               11 (/)
                 72 RETURN_VALUE
    

    In the second case, there is an unnecessary STORE_FAST followed immediately by LOAD_FAST which the compiler does not optimize away. Instead of just using the result of a.astype which is already on the stack, it first removes it from the stack, then immediately puts it back on the stack for use by BINARY_OP 11.

    Login or Signup to reply.
  2. TL;DR: the performance gap is due to the combination of 2 complex effects. On one hand, the default glibc allocator tends to give memory back to the OS and requests it again causing expensive page-faults. On the other hand, Numpy seems to optimize the division so to perform it in-place rather than out-of-place.


    First things first, I can reproduce this on a Linux machine with CPython 3.10.12 and Numpy 1.26.4, but not on a Windows machine with CPython 3.8.1 and Numpy 1.24.3. The two machines are different (AMD Zen-2 laptop vs Intel Coffee-lake Desktop). But there is a catch on Linux: the second time, it got different results while doing many Numpy-related stuff before in the interpreter (no difference between the two function). When I did the benchmark again, I got the same result, then I restarted the interpreter and got very different results again (significant reproducible difference). See the results:

    Windows:
    func1: 0.003882699995301664
    func2: 0.0038834000006318092
    func1: 0.00391739999759011
    func2: 0.00342839999939315
    func1: 0.003885100013576448
    func2: 0.003879600000800565
    func1: 0.003912400017725304
    func2: 0.004038300015963614
    func1: 0.0034078999888151884
    func2: 0.0038822999922558665
    
    Linux (usual results when the interpreter is newly started):
    func1: 0.00166860299941618
    func2: 0.0040213450010924134
    func1: 0.0016699560001143254
    func2: 0.004175404999841703
    func1: 0.0016712579999875743
    func2: 0.004019580999738537
    func1: 0.0016698750005161855
    func2: 0.004137433999858331
    func1: 0.001662620998104103
    func2: 0.004015283997432562
    
    Linux (special run):
    func1: 0.0015893029994913377
    func2: 0.0017426319973310456
    func1: 0.0016299500020977575
    func2: 0.001744294997479301
    func1: 0.0016132879973156378
    func2: 0.0017438350005249958
    func1: 0.0016315829998347908
    func2: 0.0017430830012017395
    func1: 0.0016161149978870526
    func2: 0.0017346970016660634
    

    In low-level profiling, we can see that the two functions are doing about the same computation overall. Thus, we can safely assume that the main source of the performance gap is due to the memory management. More specifically things like the allocation policy, array memory alignement, and OS paging.

    There is a simple way to check that: run the benchmark with others allocators than the default one of the glibc on Linux. For example, tcmalloc tends to request very-large chunks of memory to the OS and be conservative (ie. it does not give the memory back to the OS directly and recycle memory as much as possible). The large memory array is allocated with huge-pages so to reduce TLB issue (ie. virtual-memory to physical-memory translation cache misses) and page-fault overheads (ie. mapping of the virtual page to a physical device during a first touch of the target page). Here are (reproductible) results with tcmalloc:

    func1: 0.0016448189999209717
    func2: 0.0017599449965928216
    func1: 0.0016217849988606758
    func2: 0.0017576309983269311
    func1: 0.0016758569981902838
    func2: 0.0017460189992561936
    func1: 0.0016515909992449451
    func2: 0.0017496559994469862
    func1: 0.0016559599971515127
    func2: 0.0017453069995099213
    

    We can clearly see that tcmalloc massively reduces the performance gap.

    Low-level profiling informations show the number of page faults goes from 673_144 without tcmalloc to 18_111 with it. The number of TLB misses also goes from 15% to 7% of the loads. All of this tends to indicate that the requested memory is given back to the OS so reallocating it causes expensive brk/mmap system calls (responsible to request more memory on Linux) and/or expensive page-faults later.

    With strace, we can see that there are 1500 brk system calls without tcmalloc (during only the actual computation) while there are 6 brk with tcmalloc. Based on the number of madvise system calls (always ~1500), we can state that the number of allocation is 1500. Also please bote that addresses of the allocated pages are about the same during the benchmark both with and without tcmalloc, which means the OS recycle virtual memory pages.

    This means the brk system calls of the (default) glibc allocator on Linux are the main source of overhead. brk increases the size of the heap so this additional memory space can be reserved for the Numpy arrays. The fact that the memory addresses are reused also shows that the heap memory goes back and forth. The release of the virtual address space force the OS to flush the associated memory entries from the TLBs (dTLB+sTLB). When the TLB entry is removed from the TLB, then virtual memory needs to be mapped again to physical memory causing a page-fault during the first-touch of the Numpy array. Here are the overheads! With a more conservative allocator, like tcmalloc, this does not happen often (hence less TLB misses and far less page faults in the low-level profiling results).

    The second function is more expensive than the first because of a pathological threshold issue of the allocator : the second allocates a different amount of memory than the first (just a bit more). We can observe that by just reducing the size of the allocated array from 1_000_000 to 950_000 (using the default glibc allocator):

    func1: 0.0016297900001518428
    func2: 0.0017342270002700388
    func1: 0.0016084500020951964
    func2: 0.0017329740003333427
    func1: 0.001592951000930043
    func2: 0.0017325240005447995
    func1: 0.0016308329977618996
    func2: 0.0017360409983666614
    func1: 0.0016061759997683112
    func2: 0.001734637000481598
    

    Note that this test is fragile though (it is still dependent of the previous allocations). Another similar test consists in using variables size close to 1M items.

    An hysteresis can help to mitigate such an issue though allocating temporary memory might cause. This is what tcmalloc tends to do. The side effect of this is a higher memory footprint. Possibly much higher so to be safe. The allocator does not know the allocation pattern of the application nor the OS and they do not learn it so they can repeat the same inefficient action over and over thousands of time in this benchmark. Note that this explains why I got different results on the same platform when I played with Numpy a bit before in the interpreter : some memory was allocated before preventing this hysteresis-related issue.

    Still, the second function seems a bit slower than the first (5-6%). This is simply because func1 performs an in-place division while func2 performs an out-of-place division. Out-of-place operations are slower on x86-64 CPUs (typically because of the write-allocate policy of CPU caches but also because more data is needed in cache putting more pressure on them resulting in more miss, which is confirmed with a low-level profiler). This is surprising since we do not expect Numpy to recycle the array based on whether the CPython object is a temporary objects (referencing a view referencing itself an orphan memory buffer) or not. However, this is a great optimization regarding (only) performance. I confirmed this using a (conditional) breakpoint on malloc with GDB: func1 allocates 1 memory buffer of 8_000_000 bytes (for input arrays having 1_000_000 items) while func2 allocates 2 of them with this size. We can manually reuse the buffer ourself if needed thanks to the outparameter:

    def func3(a):
        t = a.astype(np.float64)
        return np.divide(t, 255.0, out=t)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search