skip to Main Content

Multi-threading in CPython cannot use more than one CPU in parallel because the existence of GIL. To break this limitation, we can use multiprocessing. I’m writing Python code to demonstrate that. Here is my code:

from math import sqrt
from time import time
from threading import Thread
from multiprocessing import Process


def time_recorder(job_name):
    """Record time consumption of running a function"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"Run {job_name}")
            start_epoch = time()
            func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"Time consumption of {job_name}: {time_consume}")
        return wrapper
    return deco


def calc_sqrt():
    """Consume the CPU"""
    i = 2147483647
    for j in range(20 * 1000 * 1000):
        i -= 1
        sqrt(i)


@time_recorder("one by one")
def one_by_one():
    for _ in range(8):
        calc_sqrt()


@time_recorder("multi-threading")
def multi_thread():
    t_list = list()
    for i in range(8):
        t = Thread(name=f'worker-{i}', target=calc_sqrt)
        t.start()
        t_list.append(t)
    for t in t_list:
        t.join()


@time_recorder("multi-processing")
def multi_process():
    p_list = list()
    for i in range(8):
        p = Process(name=f"worker-{i}", target=calc_sqrt)
        p.start()
        p_list.append(p)
    for p in p_list:
        p.join()


def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()


if __name__ == '__main__':
    main()

Function "calc_sqrt()" is the CPU-consuming job, which calculates square root for 20 million times. Decorator "time_recorder" calculates the running time of the decorated functions. And there are 3 functions which run the CPU-consuming job one by one, in multiple threads and in multiple processes respectively.

By running the above code on my laptop, I got the following output:

Run one by one
Time consumption of one by one: 39.31295585632324
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 39.36112403869629
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 23.380358457565308

Time consumption of "one_by_one()" and "multi_thread()" are almost the same, which are as expected. But time consumption of "multi_process()" is a little bit confusing. My laptop has an Intel Core i5-7300U CPU, which has 2 cores, 4 threads. Task manager simply shows that there are 4 (logic) CPUs in my computer. Task manager also shows that the CPU usage of all the 4 CPUs are 100% during the execution. But the processing time didn’t reduce to 1/4 but rather 1/2, why? The operating system of my laptop is Windows 10 64-bit.

Later, I tried this program on a Linux virtual machine, and got the following output, which is more reasonable:

Run one by one
Time consumption of one by one: 33.78603768348694
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 34.396817684173584
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 8.470374584197998

This time, processing time with multi-processing reduced to 1/4 of that with multi-threading. Host of this Linux server equipped with an Intel Xeon E5-2670, which has 8 cores and 16 threads. The host OS is CentOS 7. The VM is assigned with 4 vCPUs and the OS is Debian 10.

The questions are:

  • why didn’t the processing time of the multi-processing job reduce to 1/4 but rather to just 1/2 on my laptop?
  • Is it a CPU issue, which means that the 4 threads of Intel Core i5-7300U are not "real parallel" and may impact each other, and Intel Xeon E5-2670 doesn’t have that issue?
  • Or is it an OS issue, which means that Windows 10 doesn’t support multi-processing well, processes may impact each other when running in parallel?

2

Answers


  1. As said by @Pingu in comments, the speed gain very much depends on the number of cores of your machine. Your machine only has two physical cores (4 hardware threads, also called logical cores), which are probably partly occupied by your OS threads. Not only a machine with more cores will likely be more performant at multiprocessing but the OS bookkeeping will occupy less total CPU and will have a less significant impact on the performance.

    Here is an implementation of your test code that let you change the number of threads/process on which to execute N_TASKS concurrent calls to calc_sqrt:

    from math import sqrt
    from time import time
    from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
    
    
    N_WORKERS = 8
    N_TASKS = 32
    
    def time_recorder(job_name):
        """Record time consumption of running a function"""
        def deco(func):
            def wrapper(*args, **kwargs):
                print(f"Run {job_name}")
                start_epoch = time()
                out = func(*args, **kwargs)
                end_epoch = time()
                time_consume = end_epoch - start_epoch
                print(f"Time consumption of {job_name}: {time_consume:.6}s")
                return out
            return wrapper
        return deco
    
    
    def calc_sqrt(_):
        i = 2147483647
        for _ in range(5 * 1000 * 1000):
            i -= 1
            sqrt(i)
    
    
    @time_recorder("one by one")
    def one_by_one():
        _ = [calc_sqrt(_) for _ in range(N_TASKS)]
    
    
    @time_recorder("multi-threading")
    def multi_thread():
        with ThreadPoolExecutor(max_workers=N_WORKERS) as e:
            _ = e.map(calc_sqrt, range(N_TASKS))
    
    
    @time_recorder("multi-processing")
    def multi_process():
        with ProcessPoolExecutor(max_workers=N_WORKERS) as e:
            _ = e.map(calc_sqrt, range(N_TASKS), chunksize=1)
    
    
    def main():
        one_by_one()
    
        print('-' * 40)
        multi_thread()
    
        print('-' * 40)
        multi_process()
    
    
    if __name__ == '__main__':
        main()
    

    On my machine (M1 Pro MacBook Pro 14") here are the approximate timings for different number of threads/processes:

    # threads/processes Sequential Multithreading Multiprocessing
    1 10s 10s 10s
    2 10s 10s 5.5s
    4 10s 10s 2.8s
    6 10s 10s 2.2s
    8 10s 10s 1.8s
    10 10s 10s 1.8s
    12 10s 10s 1.8s

    As you can see the performance is quite proportional to the number of cores on the multiprocessing variant. This is roughly the behavior you can observe on your machines: near a 2x performance gain on your 2 cores machine and almost 4x on the 4 cores one.

    You can observe a saturation at 8 cores (there is no improvement with 10 concurrent processes), which indicates that my machine likely has 8 physical cores.

    Note that there is a difference between a CPU physical cores and hardware threads (also called hyper-threading). The Core i5-7300U CPU has 4 hardware threads but this is not equivalent to a 4 (physical) cores machine. Hyper-threading can improve the performance of a CPU multiprocessing capability but it is generally lower than adding more physical cores. For instance, Intel claims a 15% to 30% performance gain due to hyper-threading, which is far from the 2x performance gain you could imagine when reading "2-cores / 4-threads" on the CPU specs.

    Login or Signup to reply.
  2. The parallelism between two logical cores sharing a physical core is complicated. https://en.wikipedia.org/wiki/Simultaneous_multithreading (Intel brands their implementation of SMT as "hyperthreading"). There’s only one set of execution units, and the front-end alternate cycles between threads when both aren’t stalled. Out-of-order exec in the back-end does happen on uops from both logical cores (confusingly called "hardware threads") at the same time.

    If you wrote this in C, you’d get no benefit from hyperthreading for square roots, since the FP div/sqrt unit is slowish compared to everything else, very easy for one thread to max it out. (assuming it compiles to a loop doing cvtsi2sd and sqrtsd double-precision square root, which has plenty of precision). Unlike most instructions division (and square root) aren’t fully pipelined on modern CPUs: the execution unit can’t start working on a new one every clock cycle. And there’s only one such execution unit on your Kaby Lake CPU.

    But this is Python; it’s taking about 40 seconds for 20M square roots on a single core, taking about 2 microseconds per sqrt!!! At 3.5GHz, that’s 7000 clock cycles per square root, vs. an average throughput of one per 4.5 cycles for the sqrtsd asm instruction on Kaby Lake (https://uops.info/, check the SSE or AVX instruction set).

    So either interpreter overhead is even larger than usual, or its doing some fancy integer square root thing instead of taking advantage of double for small-enough integers. (Python integers are arbitrary precision). So it’s just a coincidence that hardware FPU sqrt throughput would be the bottleneck for a C program, Python is obviously not doing that, or doing so much else around it that the HW div/sqrt unit is busy for a trivial amount of time. Unless a lot of the time Python is spending is on integer division, which is also not fully pipelined.


    Hyperthreading is useful when a single thread can’t max out the execution resources of a single core, especially because of branch mispredictions, cache misses, or low instructions-per-cycle due to data dependencies (e.g. one long chain of FP adds or multiplies, so there’s no instruction-level parallelism for the CPU to find).

    Apparently the workload you picked isn’t like that. Many do get some speedup. (Often not linear algebra stuff, though; good BLAS libraries can max out the FMA execution units with one thread for things like matmul, and having 2 threads per core competing for the same cache tends to make things worse.)


    Related:

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search