Why does PySpark not find spark-submit when creating a SparkSession? - CentOS

impulsivesquidsquealer
July 11, 2021
224 views
1 vote
2 Answers

I’m trying to initialize a PySpark cluster with a Jupyter Notebook on my local machine running Linux Mint. I am following this tutorial. When I try to create a SparkSession, I get an error that spark-submit does not exist. Strangely, this is the same error I get when I try to get the version of spark-shell without including sudo.

spark1 = SparkSession.builder.appName('Test').getOrCreate()

FileNotFoundError: [Errno 2] No such file or directory: '~/Spark/spark-3.1.2-bin-hadoop3.2/./bin/spark-submit'

The correct directory for spark-submit is

'~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit' (without the extra ./, but the former directory should still be valid, right?)

I don’t know where Spark is getting this directory from, so I don’t know where to correct it.

As mentioned, I cannot even get the version of spark-shell without including sudo:

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ./spark-shell --version
./spark-shell: line 60: ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit: No such file or directory

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ls | grep spark-submit
spark-submit
spark-submit2.cmd
spark-submit.cmd

~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ sudo ./spark-shell --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.1.2
      /_/
                        
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.

I tried allowing read, write, and execute permissions to all of the files in ~/Spark with no effect.
Could this be related to Java permissions?

My .bashrc looks like this:

export SPARK_HOME='~/Spark/spark-3.1.2-bin-hadoop3.2'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

I’m using Python 3.8 and the Apache Spark 3.1.2 pre-built with Hadoop 3.2. My Java version is openjdk 11

Edit: After reinstalling (without modifying the permissions), the files in ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/ are:

$ ls -al ~/Spark/spark-3.1.2-bin-hadoop3.2/bin
total 124
drwxr-xr-x  2 squid squid  4096 May 23 21:45 .
drwxr-xr-x 13 squid squid  4096 May 23 21:45 ..
-rwxr-xr-x  1 squid squid  1089 May 23 21:45 beeline
-rw-r--r--  1 squid squid  1064 May 23 21:45 beeline.cmd
-rwxr-xr-x  1 squid squid 10965 May 23 21:45 docker-image-tool.sh
-rwxr-xr-x  1 squid squid  1935 May 23 21:45 find-spark-home
-rw-r--r--  1 squid squid  2685 May 23 21:45 find-spark-home.cmd
-rw-r--r--  1 squid squid  2337 May 23 21:45 load-spark-env.cmd
-rw-r--r--  1 squid squid  2435 May 23 21:45 load-spark-env.sh
-rwxr-xr-x  1 squid squid  2634 May 23 21:45 pyspark
-rw-r--r--  1 squid squid  1540 May 23 21:45 pyspark2.cmd
-rw-r--r--  1 squid squid  1170 May 23 21:45 pyspark.cmd
-rwxr-xr-x  1 squid squid  1030 May 23 21:45 run-example
-rw-r--r--  1 squid squid  1223 May 23 21:45 run-example.cmd
-rwxr-xr-x  1 squid squid  3539 May 23 21:45 spark-class
-rwxr-xr-x  1 squid squid  2812 May 23 21:45 spark-class2.cmd
-rw-r--r--  1 squid squid  1180 May 23 21:45 spark-class.cmd
-rwxr-xr-x  1 squid squid  1039 May 23 21:45 sparkR
-rw-r--r--  1 squid squid  1097 May 23 21:45 sparkR2.cmd
-rw-r--r--  1 squid squid  1168 May 23 21:45 sparkR.cmd
-rwxr-xr-x  1 squid squid  3122 May 23 21:45 spark-shell
-rw-r--r--  1 squid squid  1818 May 23 21:45 spark-shell2.cmd
-rw-r--r--  1 squid squid  1178 May 23 21:45 spark-shell.cmd
-rwxr-xr-x  1 squid squid  1065 May 23 21:45 spark-sql
-rw-r--r--  1 squid squid  1118 May 23 21:45 spark-sql2.cmd
-rw-r--r--  1 squid squid  1173 May 23 21:45 spark-sql.cmd
-rwxr-xr-x  1 squid squid  1040 May 23 21:45 spark-submit
-rw-r--r--  1 squid squid  1155 May 23 21:45 spark-submit2.cmd
-rw-r--r--  1 squid squid  1180 May 23 21:45 spark-submit.cmd

Answers

- IshitaV
- July 12, 2021 at 8:36 am
- 0 votes
0
Why does ‘squid’ have ownership of all these files? Can you set the user/ group ownership to the user that is used to run these submits and therefore needs all these environment variables defined in .bashrc

Login or Signup to reply.

This might sound really stupid, but I was having exactly the same problem using a custom Pyspark Kernel for jupyter notebook. What solved it was changing the "~" in spark path to "/home/{user}".
Here’s how my Kernel looks like

{
"display_name": "PySpark",
"language": "python",
"argv": [
    "/usr/bin/python3",
    "-m",
    "ipykernel",
    "-f",
    "{connection_file}"
],
"env": {
    "SPARK_HOME": "/home/rafael/spark-3.2.1-bin-hadoop3.2/",
    "PYTHONPATH": "/home/rafael/spark-3.2.1-bin-hadoop3.2/python/:~/spark-3.2.1-bin-hadoop3.2/python/lib/py4j-0.10.9.3-src.zip",
    "PYTHONSTARTUP": "/home/rafael/spark-3.2.1-bin-hadoop3.2/python/pyspark/shell.py",
    "PYSPARK_SUBMIT_ARGS": "--master local[*] --conf spark.executor.cores=1 --conf spark.executor.memory=512m pyspark-shell"
}

}

Please signup or login to give your own answer.

Click here to cancel reply.

Why does PySpark not find spark-submit when creating a SparkSession? – CentOS

Answers