skip to Main Content

Environment:

Python : 3.6.8  
OS: CentOS 7  
Spark: 2.4.5  
Hadoop:2.7.7  
Hardware: 3 computers (8 VCores available for each computer on hadoop cluster)

I constructed a simple python application. And my code is:

import numpy as np
from pyspark.sql import SparkSession
spark = SparkSession.builder 
        .appName('test_use_numpy') 
        .getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(np.arange(100))
rdd.saveAsTextFile('/result/numpy_test')
spark.stop()

I packed the virtual environment as venv.zip.And I put that on hdfs. I submited the application using the command below:

/allBigData/spark/bin/spark-submit 
--master yarn --deploy-mode cluster --num-executors 10 
--conf spark.yarn.dist.archives=hdfs:///spark/python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py

And I got error:
pyenv/venv/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

20/06/23 15:09:08 ERROR yarn.ApplicationMaster: User application exited with status 127
20/06/23 15:09:08 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 13, (reason: User application exited with status 127)
pyenv/venv/bin/python: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

I didn’t find libpython3.6m.so.1.0 in the venv.zip.But I found libpython3.6m.so.1.0 on centos. I tried to put it in venv/bin/, venv/lib/ directory,but neither of them worked. I still got the same error.
Then I tried to submit the application with the following command:

/allBigData/spark/bin/spark-submit 
--master spark://master:7077 --num-executors 10 
--conf spark.yarn.dist.archives=/home/spark/workspace_python/venv.zip#pyenv 
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
/home/spark/workspace_python/test.py

And I got a different error:
ModuleNotFoundError: No module named 'numpy'

Could anyone help me solve this problem?

2

Answers


  1. Chosen as BEST ANSWER

    Additional descriptions about the cluster:
    There are three workers/nodes/computers in the cluster. I contruct the application/code on the worker A. Woker A also works as master machine. Python has been installed by others on worker A. I installed manually python on worker B and C.

    I found a clumsy solution to solve the problem.
    I couldn't find libpython3.6m.so.1.0 in the venv.zip and the installation directory of python of the worker B and C. But I could find it on worker A. Before I installed python manually on B and C using the command:
    ./configure --with-ssl --prefix=/usr/local/python3
    And I reinstalled python on the two computers using the command:
    ./configure --prefix=/usr/local/python3 --enable-shared CFLAGS=-fPIC
    After installation, I copied libpython3.6m.so.1.0 to the directory /usr/lib64/. That way libpython3.6m.so.1.0 can be found on the two workers. Then I submitted the python application and got a different error:
    pyenv/venv/bin/python: symbol lookup error: pyenv/venv/bin/python: undefined symbol: _Py_LegacyLocaleDetected
    I used the ldd command to find the dependencies of pyenv/venv/bin/python, suspecting that the different installation directories of dependencies for the worker A and the other two workers could be the reason. So I reinstalled python on worker A following the same steps of worker B and C. Then I got the application submitted and finished successfully using the command:

    /allBigData/spark/bin/spark-submit 
    --master yarn --deploy-mode cluster --num-executors 10 
    --conf spark.yarn.dist.archives=hdfs:///spark/python/venv.zip#pyenv 
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
    --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python 
    /home/spark/workspace_python/test.py
    

    However, I still cannot submit the application successfully on standalone mode. I got error using the command:

    /allBigData/spark/bin/spark-submit 
    --master spark://master:7077 --num-executors 10 
    --archives hdfs:///spark/python/venv.zip#pyenv 
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
    --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python 
    /home/spark/workspace_python/test.py
    
    ModuleNotFoundError: No module named 'numpy'
    

    I suppose that I set the wrong property parameters (spark.yarn.appMasterEnv.PYSPARK_PYTHON / spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON) with respect to the path of python. But I don't know how to modify these. Any suggestions would be greatly appreciated.


  2. You need to pass the python.zip using spark-submit --archive tag.
    It is used when Client distributes additional resources as specified using --archives command-line option for spark-submit.

    And also add PYSPARK_DRIVER_PYTHON

    /allBigData/spark/bin/spark-submit 
    --master yarn --deploy-mode cluster --num-executors 10 
    --archives hdfs:///spark/python/venv.zip#pyenv 
    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/venv/bin/python 
    --conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=pyenv/venv/bin/python 
    /home/spark/workspace_python/test.py
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search