The code example below runs as I thought it should on two Linux machines: using Python 3.6.8 on a large CentOS-based server running Red Hat 4.8.5-39 kernel, and using Python 3.7.3 on my MX-based box running Debian 8.3.0-6 kernel).
$ python3 testshared.py filename.dat
filename.dat
270623586670000.0
However, on my Mac running Mojave 10.14.6, using Python 3.8.3, I get an error because foo=[]
in function processBigFatRow()
. Note that foo
is assigned in getBigFatData()
before starting the process pool. It’s like in Linux, the version of foo
assigned in getBigFatData()
is passed to the processes while on Mac, the processes just uses the initialization at the top of the code (which I have to put there so they are global
variable).
I understand that process are "independent copies" of the main process and that you can’t assign global variables in one process and expect them to change in the other. But what about variables already set before parallel processes are started, and that are only used by reference? It’s like process copies are not the same across OSs. Which one is working "as-designed"?
Code example:
import pylab as pl
from concurrent import futures
import sys
foo = []
bar = []
def getBigFatData(filename):
global foo, bar
# get the big fat data
print(filename)
foo = pl.arange(1000000).reshape(1000,1000)
# compute something as a result
bar = pl.sum(foo, axis=1)
def processBigFatRow(row):
total = pl.sum(foo[row,:]**2) if row % 5 else bar[row]
return total
def main():
getBigFatData(sys.argv[1])
grandTotal = 0.
rows = pl.arange(100)
with futures.ProcessPoolExecutor() as pool:
for tot in pool.map(processBigFatRow, rows):
grandTotal+=tot
print(grandTotal)
if __name__ == '__main__':
main()
EDIT:
As suggested, I tested Python 3.8.6 on my MX-Linux box, and it works.
So it works on Linux using Python 3.6.8, 3.7.3 and 3.8.6.
But it doesn’t on Mac using Python 3.8.3.
EDIT 2:
From multiprocessing doc:
On Unix a child process can make use of a shared resource created in a parent process using a global resource.
So it won’t work on Windows (and it’s not the best practice), but shouldn’t it work on Mac?
2
Answers
You are comparing the output of the same code across two different python versions. The builtin modules could be the same, or they could have changed significantly between 3.6 and 3.8. You should run the code on the same python version in both places before going any further.
That is because, on MacOS, the default multiprocessing start method has changed in Python 3.8. It went from from
fork
(py37) tospawn
(py38), causing quite its share of gnashing of teeth.With
spawn
: globals are not shared with multiprocess processes.So, practically, as a quick fix, specify a
'fork'
context in all of your invocations ofProcessPoolExecutor
, by usingmp.get_context('fork')
. But be aware of the warning above; a longer-term solution would be to share variables by using one of the techniques listed on the multiprocessing docs.For example, in your code above, replace:
with:
Alternative:
When you are just writing a small script or two, and are sure that no one using a different
main
somewhere is going to call your code, then you can set the default start method once and for all in yourmain
codeblock withmp.set_start_method
:But generally, I prefer the first approach, as you don’t have to assume that the caller has set the start method beforehand. And, as per the docs: