skip to Main Content

The code example below runs as I thought it should on two Linux machines: using Python 3.6.8 on a large CentOS-based server running Red Hat 4.8.5-39 kernel, and using Python 3.7.3 on my MX-based box running Debian 8.3.0-6 kernel).

$ python3 testshared.py filename.dat
filename.dat
270623586670000.0

However, on my Mac running Mojave 10.14.6, using Python 3.8.3, I get an error because foo=[] in function processBigFatRow(). Note that foo is assigned in getBigFatData() before starting the process pool. It’s like in Linux, the version of foo assigned in getBigFatData() is passed to the processes while on Mac, the processes just uses the initialization at the top of the code (which I have to put there so they are global variable).

I understand that process are "independent copies" of the main process and that you can’t assign global variables in one process and expect them to change in the other. But what about variables already set before parallel processes are started, and that are only used by reference? It’s like process copies are not the same across OSs. Which one is working "as-designed"?

Code example:

import pylab as pl
from concurrent import futures
import sys

foo = []
bar = []

def getBigFatData(filename):
    
    global foo, bar
    # get the big fat data
    print(filename)
    foo = pl.arange(1000000).reshape(1000,1000)
    # compute something as a result
    bar = pl.sum(foo, axis=1)

def processBigFatRow(row):
    total = pl.sum(foo[row,:]**2) if row % 5 else bar[row] 
    return total
    
def main():
    
    getBigFatData(sys.argv[1])
    
    grandTotal = 0.
    rows = pl.arange(100)
    with futures.ProcessPoolExecutor() as pool:
        for tot in pool.map(processBigFatRow, rows):
            grandTotal+=tot
    
    print(grandTotal)

if __name__ == '__main__':
    main()

EDIT:

As suggested, I tested Python 3.8.6 on my MX-Linux box, and it works.

So it works on Linux using Python 3.6.8, 3.7.3 and 3.8.6.
But it doesn’t on Mac using Python 3.8.3.

EDIT 2:

From multiprocessing doc:

On Unix a child process can make use of a shared resource created in a parent process using a global resource.

So it won’t work on Windows (and it’s not the best practice), but shouldn’t it work on Mac?

2

Answers


  1. You are comparing the output of the same code across two different python versions. The builtin modules could be the same, or they could have changed significantly between 3.6 and 3.8. You should run the code on the same python version in both places before going any further.

    Login or Signup to reply.
  2. That is because, on MacOS, the default multiprocessing start method has changed in Python 3.8. It went from from fork (py37) to spawn (py38), causing quite its share of gnashing of teeth.

    Changed in version 3.8: On macOS, the spawn start method is now the
    default. The fork start method should be considered unsafe as it can
    lead to crashes of the subprocess. See
    bpo-33725.

    With spawn: globals are not shared with multiprocess processes.

    So, practically, as a quick fix, specify a 'fork' context in all of your invocations of ProcessPoolExecutor, by using mp.get_context('fork'). But be aware of the warning above; a longer-term solution would be to share variables by using one of the techniques listed on the multiprocessing docs.

    For example, in your code above, replace:

    with ProcessPoolExecutor() as pool:
        ...
    

    with:

    import multiprocessing as mp
    
    with ProcessPoolExecutor(mp_context=mp.get_context('fork')) as executor:
        ...
    

    Alternative:

    When you are just writing a small script or two, and are sure that no one using a different main somewhere is going to call your code, then you can set the default start method once and for all in your main codeblock with mp.set_start_method:

    if __name__ == '__main__':
        mp.set_start_method('fork')
        ...
    

    But generally, I prefer the first approach, as you don’t have to assume that the caller has set the start method beforehand. And, as per the docs:

    Note that this should be called at most once, and it should be
    protected inside the if __name__ == '__main__' clause of the main
    module.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search