PySpark application submitting error on Yarn cluster mode and standalone mode - CentOS

MengyuWang
June 25, 2020
77 views
0 votes
2 Answers

Environment:

Python : 3.6.8  
OS: CentOS 7  
Spark: 2.4.5  
Hadoop:2.7.7  
Hardware: 3 computers (8 VCores available for each computer on hadoop cluster)

I constructed a simple python application. And my code is:

import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder 
        .appName('test_use_numpy') 
        .getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(np.arange(100))
rdd.saveAsTextFile('/result/numpy_test')
spark.stop()

I packed the virtual environment as venv.zip.And I put that on hdfs. I submited the application using the command below:

/allBigData/spark/bin/spark-submit 
--master yarn --deploy-mode cluster --num-executors 10 
--conf spark.yarn.dist.archives=hdfs:///spark/python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py

And I got error:
pyenv/venv/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

20/06/23 15:09:08 ERROR yarn.ApplicationMaster: User application exited with status 127
20/06/23 15:09:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 127)
pyenv/venv/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

I didn’t find libpython3.6m.so.1.0 in the venv.zip.But I found libpython3.6m.so.1.0 on centos. I tried to put it in venv/bin/, venv/lib/ directory，but neither of them worked. I still got the same error.
Then I tried to submit the application with the following command:

/allBigData/spark/bin/spark-submit 
--master spark://master:7077 --num-executors 10 
--conf spark.yarn.dist.archives=/home/spark/workspace_python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py

And I got a different error:
ModuleNotFoundError: No module named 'numpy'

Could anyone help me solve this problem?

Answers

Chosen as BEST ANSWER
- MengyuWang
- June 30, 2020 at 3:59 am
- 0 votes
0
Additional descriptions about the cluster:
There are three workers/nodes/computers in the cluster. I contruct the application/code on the worker A. Woker A also works as master machine. Python has been installed by others on worker A. I installed manually python on worker B and C.

I found a clumsy solution to solve the problem.
I couldn't find libpython3.6m.so.1.0 in the venv.zip and the installation directory of python of the worker B and C. But I could find it on worker A. Before I installed python manually on B and C using the command:
./configure --with-ssl --prefix=/usr/local/python3
And I reinstalled python on the two computers using the command:
./configure --prefix=/usr/local/python3 --enable-shared CFLAGS=-fPIC
After installation, I copied libpython3.6m.so.1.0 to the directory /usr/lib64/. That way libpython3.6m.so.1.0 can be found on the two workers. Then I submitted the python application and got a different error:
pyenv/venv/bin/python: symbol lookup error: pyenv/venv/bin/python: undefined symbol: _Py_LegacyLocaleDetected
I used the ldd command to find the dependencies of pyenv/venv/bin/python, suspecting that the different installation directories of dependencies for the worker A and the other two workers could be the reason. So I reinstalled python on worker A following the same steps of worker B and C. Then I got the application submitted and finished successfully using the command:
```
/allBigData/spark/bin/spark-submit 
--master yarn --deploy-mode cluster --num-executors 10 
--conf spark.yarn.dist.archives=hdfs:///spark/python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py
```
However, I still cannot submit the application successfully on standalone mode. I got error using the command:
```
/allBigData/spark/bin/spark-submit 
--master spark://master:7077 --num-executors 10 
--archives hdfs:///spark/python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py
```
```
ModuleNotFoundError: No module named 'numpy'
```
I suppose that I set the wrong property parameters (spark.yarn.appMasterEnv.PYSPARK_PYTHON / spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON) with respect to the path of python. But I don't know how to modify these. Any suggestions would be greatly appreciated.

(Edit)

- Snigdhajyoti
- June 25, 2020 at 1:09 pm
- 0 votes
0
You need to pass the python.zip using spark-submit --archive tag.
It is used when Client distributes additional resources as specified using --archives command-line option for spark-submit.

And also add PYSPARK_DRIVER_PYTHON
```
/allBigData/spark/bin/spark-submit 
--master yarn --deploy-mode cluster --num-executors 10 
--archives hdfs:///spark/python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

PySpark application submitting error on Yarn cluster mode and standalone mode – CentOS

Answers