I’m trying to initialize a PySpark cluster with a Jupyter Notebook on my local machine running Linux Mint. I am following this tutorial. When I try to create a SparkSession, I get an error that spark-submit
does not exist. Strangely, this is the same error I get when I try to get the version of spark-shell
without including sudo
.
spark1 = SparkSession.builder.appName('Test').getOrCreate()
FileNotFoundError: [Errno 2] No such file or directory: '~/Spark/spark-3.1.2-bin-hadoop3.2/./bin/spark-submit'
The correct directory for spark-submit
is
'~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit'
(without the extra ./
, but the former directory should still be valid, right?)
I don’t know where Spark is getting this directory from, so I don’t know where to correct it.
As mentioned, I cannot even get the version of spark-shell without including sudo
:
~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ./spark-shell --version
./spark-shell: line 60: ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit: No such file or directory
~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ ls | grep spark-submit
spark-submit
spark-submit2.cmd
spark-submit.cmd
~/Spark/spark-3.1.2-bin-hadoop3.2/bin$ sudo ./spark-shell --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.1.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 11.0.11
Branch HEAD
Compiled by user centos on 2021-05-24T04:27:48Z
Revision de351e30a90dd988b133b3d00fa6218bfcaba8b8
Url https://github.com/apache/spark
Type --help for more information.
I tried allowing read, write, and execute permissions to all of the files in ~/Spark
with no effect.
Could this be related to Java permissions?
My .bashrc
looks like this:
export SPARK_HOME='~/Spark/spark-3.1.2-bin-hadoop3.2'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
I’m using Python 3.8 and the Apache Spark 3.1.2 pre-built with Hadoop 3.2. My Java version is openjdk 11
Edit: After reinstalling (without modifying the permissions), the files in ~/Spark/spark-3.1.2-bin-hadoop3.2/bin/
are:
$ ls -al ~/Spark/spark-3.1.2-bin-hadoop3.2/bin
total 124
drwxr-xr-x 2 squid squid 4096 May 23 21:45 .
drwxr-xr-x 13 squid squid 4096 May 23 21:45 ..
-rwxr-xr-x 1 squid squid 1089 May 23 21:45 beeline
-rw-r--r-- 1 squid squid 1064 May 23 21:45 beeline.cmd
-rwxr-xr-x 1 squid squid 10965 May 23 21:45 docker-image-tool.sh
-rwxr-xr-x 1 squid squid 1935 May 23 21:45 find-spark-home
-rw-r--r-- 1 squid squid 2685 May 23 21:45 find-spark-home.cmd
-rw-r--r-- 1 squid squid 2337 May 23 21:45 load-spark-env.cmd
-rw-r--r-- 1 squid squid 2435 May 23 21:45 load-spark-env.sh
-rwxr-xr-x 1 squid squid 2634 May 23 21:45 pyspark
-rw-r--r-- 1 squid squid 1540 May 23 21:45 pyspark2.cmd
-rw-r--r-- 1 squid squid 1170 May 23 21:45 pyspark.cmd
-rwxr-xr-x 1 squid squid 1030 May 23 21:45 run-example
-rw-r--r-- 1 squid squid 1223 May 23 21:45 run-example.cmd
-rwxr-xr-x 1 squid squid 3539 May 23 21:45 spark-class
-rwxr-xr-x 1 squid squid 2812 May 23 21:45 spark-class2.cmd
-rw-r--r-- 1 squid squid 1180 May 23 21:45 spark-class.cmd
-rwxr-xr-x 1 squid squid 1039 May 23 21:45 sparkR
-rw-r--r-- 1 squid squid 1097 May 23 21:45 sparkR2.cmd
-rw-r--r-- 1 squid squid 1168 May 23 21:45 sparkR.cmd
-rwxr-xr-x 1 squid squid 3122 May 23 21:45 spark-shell
-rw-r--r-- 1 squid squid 1818 May 23 21:45 spark-shell2.cmd
-rw-r--r-- 1 squid squid 1178 May 23 21:45 spark-shell.cmd
-rwxr-xr-x 1 squid squid 1065 May 23 21:45 spark-sql
-rw-r--r-- 1 squid squid 1118 May 23 21:45 spark-sql2.cmd
-rw-r--r-- 1 squid squid 1173 May 23 21:45 spark-sql.cmd
-rwxr-xr-x 1 squid squid 1040 May 23 21:45 spark-submit
-rw-r--r-- 1 squid squid 1155 May 23 21:45 spark-submit2.cmd
-rw-r--r-- 1 squid squid 1180 May 23 21:45 spark-submit.cmd
2
Answers
Why does ‘squid’ have ownership of all these files? Can you set the user/ group ownership to the user that is used to run these submits and therefore needs all these environment variables defined in .bashrc
This might sound really stupid, but I was having exactly the same problem using a custom Pyspark Kernel for jupyter notebook. What solved it was changing the "~" in spark path to "/home/{user}".
Here’s how my Kernel looks like
}