Using emr-5.33.1 and python3.7.16.
Goal is to add petastorm==0.12.1 into EMR. These are the steps to install it in EMR (worked until now)
- Add all required dependencies of petastorm and itself into s3 folder
- copy paste all libraries from s3 into temporary folder ex:
aws s3 cp s3_whl_files_path ./tmpfolder/ --recursive --region=<region-name>
- add pip install command
sudo python3 -m pip install --no-index --find-links=./tmpfolder petastorm==0.12.1
These are following logs from bootstrap-actions:
- From node/stdout.gz : did not output ‘successfully installed petastorm’ it stopped while
Processing ./tmpfolder/pyspark-2.4.7.tar.gz
which is dependency library of petastorm. - From node/stderr.gz : did not output any errors.
and log from the application:
- From containers/stdout.gz :
ModuleNotFoundError: No module named 'petastorm'
What I’ve tried so far.
-
I’ve noticed that some of petastorm dependency libraries were not being successfully installed therefore added them in my bootstrap shell script which succeeded. Still, module is not found upon import and when I look at
bootstrap-actions/node/stdout.gz
it does not successfully install pyspark==2.4.7 which is dependency of petastorm. I’m assuming it is not installed because all other libraries havesuccessfully installed <library name>
withinbootstrap-actions/node/stdout.gz
log -
I’ve added pyspark within bootstrap.sh and still same error.
-
I’ve added dependency library
py4j
in bootstrap.sh however even though it successfully installspy4j
still not installing pyspark==2.4.7
Weird thing is I’ve been using pyspark code within EMR and worked fine, why can’t petastorm simply skip installation of pyspark as it is already installed in EMR instance?
2
Answers
Did you successfully test the package installation first in an EMR node? If not, doing that could help to diagnose any potential issue with the pip installation.
I didn’t understand which log location you are referring to when you say node/stdout.gz. Is it the bootstrap-action log? If it didn’t log successful completion of the script execution, something likely failed in between. You may want to set verbose/debug for your commands in bootstrap script for effective troubleshooting.
You mentioned
ModuleNotFoundError
in container/stdout.gz. Did you check if the module can be imported normally from the Python interpreter before submitting jobs?In my team, we face an analogous problem to yours – we have standard and our custom Python libraries we want to be available on all EMR nodes. Moreover, we want the versions of these libraries to be exactly the same as those used in local development (e.g. when running unit tests for Python code or PySpark code via Spark local), so that any problems specific to a particular version of a package are discovered during dev testing rather than only in Beta after pushing code.
The way we are solving it is by building our own Docker image and then using the set-up described on this guide:
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html
We’ve been using this for the past 9 months (with set up encoded via CDK) without problems. I know that for you, it sounds like an overkill, but I am just sharing this in case you can’t make progress with the bootstrap script.
One thing that you might want to try before that is making it a "Step" rather than bootstrap script. Bootstrap scripts get ran before EMR installs all the software (e.g. Spark), so if this
petastorm
library has PySpark as a dependency, that might be causing the attempt to install PySpark, even though EMR itself handles that.