The following two functions logically do the same thing, but func2
, which stores the result in a temporary variable, is slower.
def func1(a):
return a.astype(np.float64) / 255.0
def func2(a):
t = a.astype(np.float64)
return t / 255.0
Benchmark:
import sys
import timeit
import numpy as np
def func1(a):
return a.astype(np.float64) / 255.0
def func2(a):
t = a.astype(np.float64)
return t / 255.0
def main():
print(f"{np.__version__=}")
print(f"{sys.version=}")
size = 1_000_000
n_repeats = 100
n_epochs = 5
a = np.random.randint(0, 256, size=size, dtype=np.uint8)
for f in [func1, func2] * n_epochs:
times = timeit.repeat(lambda: f(a), number=1, repeat=n_repeats)
print(f"{f.__name__}: {min(times)}")
main()
Result:
np.__version__='1.26.4'
sys.version='3.12.4 (main, Jul 16 2024, 19:42:31) [GCC 11.4.0]'
func1: 0.0013509448617696762
func2: 0.0035865511745214462
func1: 0.0013513723388314247
func2: 0.0034992704167962074
func1: 0.0013509979471564293
func2: 0.003565799444913864
func1: 0.0013509783893823624
func2: 0.003563949838280678
func1: 0.0013510659337043762
func2: 0.003569650463759899
For a size
of 1_000_000_000
:
func1: 2.503432061523199
func2: 3.3956982269883156
func1: 2.503927574492991
func2: 3.393561664968729
func1: 2.5052043283358216
func2: 3.3980945963412523
func1: 2.503149318508804
func2: 3.39398608263582
func1: 2.5073573794215918
func2: 3.396817682310939
Although the relative difference has decreased, the absolute difference is now almost 1 second.
So, I assume there is some difference that affects the entire array, but I couldn’t figure out what it was.
Here are the bytecodes for the two functions printed by the dis module.
8 0 RESUME 0
9 2 LOAD_FAST 0 (a)
4 LOAD_ATTR 1 (NULL|self + astype)
24 LOAD_GLOBAL 2 (np)
34 LOAD_ATTR 4 (float64)
54 CALL 1
62 LOAD_CONST 1 (255.0)
64 BINARY_OP 11 (/)
68 RETURN_VALUE
12 0 RESUME 0
13 2 LOAD_FAST 0 (a)
4 LOAD_ATTR 1 (NULL|self + astype)
24 LOAD_GLOBAL 2 (np)
34 LOAD_ATTR 4 (float64)
54 CALL 1
62 STORE_FAST 1 (t)
14 64 LOAD_FAST 1 (t)
66 LOAD_CONST 1 (255.0)
68 BINARY_OP 11 (/)
72 RETURN_VALUE
As you can see, the only difference is STORE_FAST
and LOAD_FAST
of t
, which I thought would not affect the entire array. What am I misunderstanding?
Please note that I was only able to reproduce this on Ubuntu (probably GCC), and not on Windows.
I can’t tell whether it’s OS-dependent or hardware-dependent.
UPDATE:
As @no-comment pointed out in the comment, the first timing (i.e. immediately after the function switches) is noticeably different from the later timings. So I think we can assume that it was affected by the previous call.
func1: 0.0014957496896386147 times[:5]=[0.003002448007464409, 0.002440572716295719, 0.0015078019350767136, 0.001591850072145462, 0.0014987429603934288]
func2: 0.0031856298446655273 times[:5]=[0.0031856298446655273, 0.0035560475662350655, 0.003505030646920204, 0.0035979915410280228, 0.0035021230578422546]
func1: 0.0014953771606087685 times[:5]=[0.002430872991681099, 0.0015014195814728737, 0.0014976952224969864, 0.0015024356544017792, 0.001497727818787098]
func2: 0.0031922338530421257 times[:5]=[0.0031922338530421257, 0.0035140514373779297, 0.003572382964193821, 0.003530893474817276, 0.003493628464639187]
func1: 0.0014957757666707039 times[:5]=[0.0024183401837944984, 0.001501905731856823, 0.0015066321939229965, 0.0015012985095381737, 0.0015163430944085121]
func2: 0.003169679082930088 times[:5]=[0.003169679082930088, 0.0035164309665560722, 0.0034868214279413223, 0.0034810323268175125, 0.003540727309882641]
func1: 0.0014958055689930916 times[:5]=[0.0024264072999358177, 0.0015287799760699272, 0.0014993362128734589, 0.001505572348833084, 0.001497725024819374]
func2: 0.003466635011136532 times[:5]=[0.0038485946133732796, 0.0054071033373475075, 0.0054869744926691055, 0.0034872666001319885, 0.003563432954251766]
func1: 0.0014960560947656631 times[:5]=[0.0024131694808602333, 0.001533709466457367, 0.0015180828049778938, 0.0015001073479652405, 0.0014980453997850418]
func2: 0.003209024667739868 times[:5]=[0.003209024667739868, 0.0037086624652147293, 0.0035465294495224953, 0.0034985439851880074, 0.00348656065762043]
UPDATE2:
I was able to confirm that func1
reuses memory for division using the following method.
def show_address(a):
print(np.byte_bounds(a))
return a
def func1_with_hook(a):
return show_address(show_address(a.astype(np.float64)) / 255.0)
def func2_with_hook(a):
t = a.astype(np.float64)
return show_address(show_address(t) / 255.0)
However, I discovered something else in the process.
What I tried to do was to play around with memory outside of functions and investigate the changes in performance depending on whether memory (address) was reused or not. In the process, I realized that by storing the result of the function in a variable, func2
runs just as fast as func1
.
def main():
print(f"{np.__version__=}")
print(f"{sys.version=}")
print("-" * 50)
size, n_repeats, n_epochs = 1_000_000, 100, 5
a = np.random.randint(0, 256, size=size, dtype=np.uint8)
for f in [func1, func2] * n_epochs:
times = []
epoch_started = time.perf_counter()
for _ in range(n_repeats):
started = time.perf_counter()
# f(a)
r = f(a) # This makes func2 faster.
times.append(time.perf_counter() - started)
epoch_time = time.perf_counter() - epoch_started
print(f"{f.__name__}: min={min(times):.6f} total={epoch_time:.4f} times={[round(t * 1000, 2) for t in times[:20]]}")
main()
func1: min=0.001484 total=0.1595 times=[4.68, 3.6, 2.95, 2.93, 1.62, 1.5, 1.52, 1.53, 1.7, 1.76, 1.57, 1.51, 1.5, 1.49, 1.57, 1.49, 1.51, 1.49, 1.5, 1.6]
func2: min=0.001506 total=0.1554 times=[2.89, 1.54, 1.53, 1.54, 1.51, 1.53, 1.54, 1.57, 1.6, 1.56, 1.54, 1.52, 1.52, 1.54, 1.54, 1.56, 1.55, 1.54, 1.56, 1.54]
func1: min=0.001485 total=0.1634 times=[2.01, 2.91, 1.51, 1.5, 1.5, 1.49, 1.5, 1.49, 1.5, 1.49, 1.5, 1.56, 1.56, 1.53, 1.52, 1.61, 1.64, 1.53, 1.5, 1.49]
func2: min=0.001516 total=0.1609 times=[2.94, 1.6, 1.53, 1.58, 1.53, 1.57, 1.59, 1.53, 1.54, 1.56, 1.57, 1.62, 1.55, 1.54, 1.55, 1.56, 1.58, 1.57, 1.58, 1.56]
func1: min=0.001485 total=0.1655 times=[1.93, 2.86, 1.52, 1.49, 1.49, 1.49, 1.5, 1.49, 1.52, 1.55, 1.56, 1.5, 1.5, 1.5, 1.5, 1.49, 1.51, 1.52, 1.51, 1.49]
func2: min=0.001507 total=0.1623 times=[2.87, 1.59, 1.56, 1.58, 1.58, 1.63, 1.66, 1.64, 1.64, 1.61, 1.59, 1.61, 1.58, 1.57, 1.59, 1.57, 1.55, 1.56, 1.56, 1.61]
func1: min=0.001485 total=0.1522 times=[1.93, 2.86, 1.51, 1.49, 1.5, 1.6, 1.52, 1.5, 1.5, 1.49, 1.5, 1.51, 1.5, 1.5, 1.5, 1.49, 1.51, 1.49, 1.51, 1.49]
func2: min=0.001506 total=0.1557 times=[2.88, 1.54, 1.53, 1.52, 1.51, 1.52, 1.51, 1.51, 1.52, 1.52, 1.59, 1.58, 1.59, 1.59, 1.54, 1.55, 1.52, 1.53, 1.53, 1.52]
func1: min=0.001486 total=0.1538 times=[1.89, 2.93, 1.62, 1.52, 1.51, 1.52, 1.52, 1.51, 1.51, 1.49, 1.5, 1.49, 1.5, 1.5, 1.5, 1.49, 1.5, 1.5, 1.51, 1.51]
func2: min=0.001506 total=0.1566 times=[2.89, 1.56, 1.53, 1.52, 1.55, 1.57, 1.58, 1.54, 1.53, 1.55, 1.54, 1.53, 1.53, 1.58, 1.62, 1.61, 1.6, 1.6, 1.58, 1.59]
Please note that I also measured the runtime of the entire loop (see the total
above). It is expected that the timing of the deletion will change, but it should still be within the measured section.
This result brought me back to square one. It is most certainly a memory management issue, but what specific factor is causing it is quite confusing.
2
Answers
Look at the byte code generated for the first:
versus the second
In the second case, there is an unnecessary
STORE_FAST
followed immediately byLOAD_FAST
which the compiler does not optimize away. Instead of just using the result ofa.astype
which is already on the stack, it first removes it from the stack, then immediately puts it back on the stack for use byBINARY_OP 11
.TL;DR: the performance gap is due to the combination of 2 complex effects. On one hand, the default glibc allocator tends to give memory back to the OS and requests it again causing expensive page-faults. On the other hand, Numpy seems to optimize the division so to perform it in-place rather than out-of-place.
First things first, I can reproduce this on a Linux machine with CPython 3.10.12 and Numpy 1.26.4, but not on a Windows machine with CPython 3.8.1 and Numpy 1.24.3. The two machines are different (AMD Zen-2 laptop vs Intel Coffee-lake Desktop). But there is a catch on Linux: the second time, it got different results while doing many Numpy-related stuff before in the interpreter (no difference between the two function). When I did the benchmark again, I got the same result, then I restarted the interpreter and got very different results again (significant reproducible difference). See the results:
In low-level profiling, we can see that the two functions are doing about the same computation overall. Thus, we can safely assume that the main source of the performance gap is due to the memory management. More specifically things like the allocation policy, array memory alignement, and OS paging.
There is a simple way to check that: run the benchmark with others allocators than the default one of the glibc on Linux. For example, tcmalloc tends to request very-large chunks of memory to the OS and be conservative (ie. it does not give the memory back to the OS directly and recycle memory as much as possible). The large memory array is allocated with huge-pages so to reduce TLB issue (ie. virtual-memory to physical-memory translation cache misses) and page-fault overheads (ie. mapping of the virtual page to a physical device during a first touch of the target page). Here are (reproductible) results with tcmalloc:
We can clearly see that tcmalloc massively reduces the performance gap.
Low-level profiling informations show the number of page faults goes from 673_144 without tcmalloc to 18_111 with it. The number of TLB misses also goes from 15% to 7% of the loads. All of this tends to indicate that the requested memory is given back to the OS so reallocating it causes expensive
brk
/mmap
system calls (responsible to request more memory on Linux) and/or expensive page-faults later.With
strace
, we can see that there are 1500brk
system calls without tcmalloc (during only the actual computation) while there are 6brk
with tcmalloc. Based on the number ofmadvise
system calls (always ~1500), we can state that the number of allocation is 1500. Also please bote that addresses of the allocated pages are about the same during the benchmark both with and without tcmalloc, which means the OS recycle virtual memory pages.This means the
brk
system calls of the (default) glibc allocator on Linux are the main source of overhead.brk
increases the size of the heap so this additional memory space can be reserved for the Numpy arrays. The fact that the memory addresses are reused also shows that the heap memory goes back and forth. The release of the virtual address space force the OS to flush the associated memory entries from the TLBs (dTLB+sTLB). When the TLB entry is removed from the TLB, then virtual memory needs to be mapped again to physical memory causing a page-fault during the first-touch of the Numpy array. Here are the overheads! With a more conservative allocator, like tcmalloc, this does not happen often (hence less TLB misses and far less page faults in the low-level profiling results).The second function is more expensive than the first because of a pathological threshold issue of the allocator : the second allocates a different amount of memory than the first (just a bit more). We can observe that by just reducing the size of the allocated array from 1_000_000 to 950_000 (using the default glibc allocator):
Note that this test is fragile though (it is still dependent of the previous allocations). Another similar test consists in using variables size close to 1M items.
An hysteresis can help to mitigate such an issue though allocating temporary memory might cause. This is what tcmalloc tends to do. The side effect of this is a higher memory footprint. Possibly much higher so to be safe. The allocator does not know the allocation pattern of the application nor the OS and they do not learn it so they can repeat the same inefficient action over and over thousands of time in this benchmark. Note that this explains why I got different results on the same platform when I played with Numpy a bit before in the interpreter : some memory was allocated before preventing this hysteresis-related issue.
Still, the second function seems a bit slower than the first (5-6%). This is simply because
func1
performs an in-place division whilefunc2
performs an out-of-place division. Out-of-place operations are slower on x86-64 CPUs (typically because of the write-allocate policy of CPU caches but also because more data is needed in cache putting more pressure on them resulting in more miss, which is confirmed with a low-level profiler). This is surprising since we do not expect Numpy to recycle the array based on whether the CPython object is a temporary objects (referencing a view referencing itself an orphan memory buffer) or not. However, this is a great optimization regarding (only) performance. I confirmed this using a (conditional) breakpoint onmalloc
with GDB:func1
allocates 1 memory buffer of 8_000_000 bytes (for input arrays having 1_000_000 items) whilefunc2
allocates 2 of them with this size. We can manually reuse the buffer ourself if needed thanks to theout
parameter: