Environment:
Python : 3.6.8
OS: CentOS 7
Spark: 2.4.5
Hadoop:2.7.7
Hardware: 3 computers (8 VCores available for each computer on hadoop cluster)
I constructed a simple python application. And my code is:
import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder
.appName('test_use_numpy')
.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(np.arange(100))
rdd.saveAsTextFile('/result/numpy_test')
spark.stop()
I packed the virtual environment as venv.zip.And I put that on hdfs. I submited the application using the command below:
/allBigData/spark/bin/spark-submit
--master yarn --deploy-mode cluster --num-executors 10
--conf spark.yarn.dist.archives=hdfs:///spark/python/venv.zip#pyenv
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python
/home/spark/workspace_python/test.py
And I got error:
pyenv/venv/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
20/06/23 15:09:08 ERROR yarn.ApplicationMaster: User application exited with status 127
20/06/23 15:09:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 127)
pyenv/venv/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory
I didn’t find libpython3.6m.so.1.0 in the venv.zip.But I found libpython3.6m.so.1.0 on centos. I tried to put it in venv/bin/, venv/lib/ directory,but neither of them worked. I still got the same error.
Then I tried to submit the application with the following command:
/allBigData/spark/bin/spark-submit
--master spark://master:7077 --num-executors 10
--conf spark.yarn.dist.archives=/home/spark/workspace_python/venv.zip#pyenv
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python
/home/spark/workspace_python/test.py
And I got a different error:
ModuleNotFoundError: No module named 'numpy'
Could anyone help me solve this problem?
2
Answers
Additional descriptions about the cluster:
There are three workers/nodes/computers in the cluster. I contruct the application/code on the worker A. Woker A also works as master machine. Python has been installed by others on worker A. I installed manually python on worker B and C.
I found a clumsy solution to solve the problem.
I couldn't find libpython3.6m.so.1.0 in the venv.zip and the installation directory of python of the worker B and C. But I could find it on worker A. Before I installed python manually on B and C using the command:
./configure --with-ssl --prefix=/usr/local/python3
And I reinstalled python on the two computers using the command:
./configure --prefix=/usr/local/python3 --enable-shared CFLAGS=-fPIC
After installation, I copied libpython3.6m.so.1.0 to the directory /usr/lib64/. That way libpython3.6m.so.1.0 can be found on the two workers. Then I submitted the python application and got a different error:
pyenv/venv/bin/python: symbol lookup error: pyenv/venv/bin/python: undefined symbol: _Py_LegacyLocaleDetected
I used the ldd command to find the dependencies of pyenv/venv/bin/python, suspecting that the different installation directories of dependencies for the worker A and the other two workers could be the reason. So I reinstalled python on worker A following the same steps of worker B and C. Then I got the application submitted and finished successfully using the command:
However, I still cannot submit the application successfully on standalone mode. I got error using the command:
I suppose that I set the wrong property parameters (spark.yarn.appMasterEnv.PYSPARK_PYTHON / spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON) with respect to the path of python. But I don't know how to modify these. Any suggestions would be greatly appreciated.
You need to pass the python.zip using spark-submit
--archive
tag.It is used when Client distributes additional resources as specified using
--archives
command-line option for spark-submit.And also add
PYSPARK_DRIVER_PYTHON