I have a pyspark job that is distributed in multiple code files in this structure:
flexible_clendar
- Cache
- redis_main.py
- Helpers
- helpers.py
- Spark
- spark_main.py
- main.py
In the ‘main.py’ I’m using the functions from ‘helpers.py’, ‘redis_main.py’, etc…
The ‘flexible_calendar’ folder is uploaded in S3 bucket, so that the EMR could run the code from it.
Iv’e created an EMR cluster that is bootstraped with all the needed packages and it is working if I’m running a simple-one file code (from s3) with all the functions in it:
The problem is when I’m trying to use the distributed file structure the code fails, because it doesn’t recognize the files from ‘helpers.py’, ‘spark_main’, etc… like so:
I’ve tried multiple configurations in the ‘Step Arguments’ field which none of them worked, such as:
Arguments: spark-submit --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr
Arguments: spark-submit --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr/Cache/redis_main.py s3://flexible-calendar/flexible-calendar-emr/Helpers/helpers.py s3://flexible-calendar/flexible-calendar-emr/Spark/spark_main.py s3://flexible-calendar/flexible-calendar-emr/main.py
Arguments: spark-submit --deploy-mode cluster --class s3://flexible-calendar/flexible-calendar-emr s3://flexible-calendar/flexible-calendar-emr/main.py
Arguments: spark-submit --deploy-mode cluster --class s3://flexible-calendar/main_one.py
Also:
Arguments: spark-submit --py-files s3://flexible-calendar/flexible-calendar-emr.zip
Arguments: spark-submit --deploy-mode --py-files s3://flexible-calendar/flexible-calendar-emr.zip
Arguments: spark-submit --py-files s3://flexible-calendar/flexible-calendar-emr.zip --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr/Spark/spark_main.py
Arguments: spark-submit --deploy-mode cluster s3://flexible-calendar/flexible-calendar-emr/Spark/spark_main.py --py-files s3://flexible-calendar/flexible-calendar-emr.zip
and more…
Hope someone could help,
Thanks.
2
Answers
Quoting from Spark Documentation:
So the zip file you have created needs to be added to the sys path inside the main.py.
Let me know of this helps!