I’m using the jupyter/pyspark-notebook docker image to develop a spark script. My dockerfile looks like this:
FROM jupyter/pyspark-notebook
USER root
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt
# this is a default user and the image is configured to use it
ARG NB_USER=jovyan
ARG NB_UID=1000
ARG NB_GID=100
ENV USER ${NB_USER}
ENV HOME /home/${NB_USER}
RUN groupadd -f ${USER} &&
chown -R ${USER}:${USER} ${HOME}
USER ${NB_USER}
RUN export PACKAGES="io.delta:delta-core_2.12:1.0.0"
RUN export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
My requirements.txt looks like this:
delta-spark==2.1.0
deltalake==0.10.1
jupyterlab==4.0.6
pandas==2.1.0
pyspark==3.3.3
I build and run the image via docker compose, and then attempt to run this in a notebook:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("LocalDelta")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
And get the following error:
AttributeError Traceback (most recent call last)
Cell In[2], line 2
1 import pyspark
----> 2 from delta import *
4 builder = pyspark.sql.SparkSession.builder.appName("LocalDelta")
5 .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
6 .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
8 spark = configure_spark_with_delta_pip(builder).getOrCreate()
File /opt/conda/lib/python3.11/site-packages/delta/__init__.py:17
1 #
2 # Copyright (2021) The Delta Lake Project Authors.
3 #
(...)
14 # limitations under the License.
15 #
---> 17 from delta.tables import DeltaTable
18 from delta.pip_utils import configure_spark_with_delta_pip
20 __all__ = ['DeltaTable', 'configure_spark_with_delta_pip']
File /opt/conda/lib/python3.11/site-packages/delta/tables.py:21
1 #
2 # Copyright (2021) The Delta Lake Project Authors.
3 #
(...)
14 # limitations under the License.
15 #
17 from typing import (
18 TYPE_CHECKING, cast, overload, Any, Iterable, Optional, Union, NoReturn, List, Tuple
19 )
---> 21 import delta.exceptions # noqa: F401; pylint: disable=unused-variable
22 from delta._typing import (
23 ColumnMapping, OptionalColumnMapping, ExpressionOrColumn, OptionalExpressionOrColumn
24 )
26 from pyspark import since
File /opt/conda/lib/python3.11/site-packages/delta/exceptions.py:166
162 utils.convert_exception = convert_delta_exception
165 if not _delta_exception_patched:
--> 166 _patch_convert_exception()
167 _delta_exception_patched = True
File /opt/conda/lib/python3.11/site-packages/delta/exceptions.py:154, in _patch_convert_exception()
149 def _patch_convert_exception() -> None:
150 """
151 Patch PySpark's exception convert method to convert Delta's Scala concurrent exceptions to the
152 corresponding Python exceptions.
153 """
--> 154 original_convert_sql_exception = utils.convert_exception
156 def convert_delta_exception(e: "JavaObject") -> CapturedException:
157 delta_exception = _convert_delta_exception(e)
AttributeError: module 'pyspark.sql.utils' has no attribute 'convert_exception'
It seems there is an incompatibility between the pyspark and delta versions, but I haven’t been able to find anything on stack overflow or anywhere else to point me in the right direction. I based the code on this example: https://github.com/handreassa/delta-docker/tree/main.
Any help would be much appreciated.
2
Answers
Your
delta-spark==2.1.0
version has to match the version of the jar added via--packages
. Soset:
RUN export PACKAGES="io.delta:delta-core_2.12:2.1.0"
As mentioned by @boyangeor, there was a need to match the export packages with the spark version. Additionally, the docker image was upgraded lately to the spark version 3.5 which is not currently supported by delta (as per https://docs.delta.io/latest/releases.html). Because of that change, I used this tag to get the proper one (jupyter/pyspark-notebook:spark-3.4.1).
I am the owner of https://github.com/handreassa/delta-docker repo, so I already did all the changes needed to get it back working (https://github.com/handreassa/delta-docker/commit/f0ef9c387a20565ea75d6a846ca354b2052709f6).
Code tested in the image for reference: