skip to Main Content

I’m using the jupyter/pyspark-notebook docker image to develop a spark script. My dockerfile looks like this:

FROM jupyter/pyspark-notebook

USER root
COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt && rm requirements.txt

# this is a default user and the image is configured to use it
ARG NB_USER=jovyan
ARG NB_UID=1000
ARG NB_GID=100

ENV USER ${NB_USER}
ENV HOME /home/${NB_USER}
RUN groupadd -f ${USER} && 
    chown -R ${USER}:${USER} ${HOME}

USER ${NB_USER}

RUN export PACKAGES="io.delta:delta-core_2.12:1.0.0"
RUN export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"

My requirements.txt looks like this:

delta-spark==2.1.0
deltalake==0.10.1
jupyterlab==4.0.6
pandas==2.1.0
pyspark==3.3.3

I build and run the image via docker compose, and then attempt to run this in a notebook:

import pyspark
from delta import *

builder = pyspark.sql.SparkSession.builder.appName("LocalDelta") 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

And get the following error:

AttributeError                            Traceback (most recent call last)
Cell In[2], line 2
      1 import pyspark
----> 2 from delta import *
      4 builder = pyspark.sql.SparkSession.builder.appName("LocalDelta") 
      5     .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
      6     .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
      8 spark = configure_spark_with_delta_pip(builder).getOrCreate()

File /opt/conda/lib/python3.11/site-packages/delta/__init__.py:17
      1 #
      2 # Copyright (2021) The Delta Lake Project Authors.
      3 #
   (...)
     14 # limitations under the License.
     15 #
---> 17 from delta.tables import DeltaTable
     18 from delta.pip_utils import configure_spark_with_delta_pip
     20 __all__ = ['DeltaTable', 'configure_spark_with_delta_pip']

File /opt/conda/lib/python3.11/site-packages/delta/tables.py:21
      1 #
      2 # Copyright (2021) The Delta Lake Project Authors.
      3 #
   (...)
     14 # limitations under the License.
     15 #
     17 from typing import (
     18     TYPE_CHECKING, cast, overload, Any, Iterable, Optional, Union, NoReturn, List, Tuple
     19 )
---> 21 import delta.exceptions  # noqa: F401; pylint: disable=unused-variable
     22 from delta._typing import (
     23     ColumnMapping, OptionalColumnMapping, ExpressionOrColumn, OptionalExpressionOrColumn
     24 )
     26 from pyspark import since

File /opt/conda/lib/python3.11/site-packages/delta/exceptions.py:166
    162     utils.convert_exception = convert_delta_exception
    165 if not _delta_exception_patched:
--> 166     _patch_convert_exception()
    167     _delta_exception_patched = True

File /opt/conda/lib/python3.11/site-packages/delta/exceptions.py:154, in _patch_convert_exception()
    149 def _patch_convert_exception() -> None:
    150     """
    151     Patch PySpark's exception convert method to convert Delta's Scala concurrent exceptions to the
    152     corresponding Python exceptions.
    153     """
--> 154     original_convert_sql_exception = utils.convert_exception
    156     def convert_delta_exception(e: "JavaObject") -> CapturedException:
    157         delta_exception = _convert_delta_exception(e)

AttributeError: module 'pyspark.sql.utils' has no attribute 'convert_exception'

It seems there is an incompatibility between the pyspark and delta versions, but I haven’t been able to find anything on stack overflow or anywhere else to point me in the right direction. I based the code on this example: https://github.com/handreassa/delta-docker/tree/main.

Any help would be much appreciated.

2

Answers


  1. Your delta-spark==2.1.0 version has to match the version of the jar added via --packages. So
    set: RUN export PACKAGES="io.delta:delta-core_2.12:2.1.0"

    Login or Signup to reply.
  2. As mentioned by @boyangeor, there was a need to match the export packages with the spark version. Additionally, the docker image was upgraded lately to the spark version 3.5 which is not currently supported by delta (as per https://docs.delta.io/latest/releases.html). Because of that change, I used this tag to get the proper one (jupyter/pyspark-notebook:spark-3.4.1).

    I am the owner of https://github.com/handreassa/delta-docker repo, so I already did all the changes needed to get it back working (https://github.com/handreassa/delta-docker/commit/f0ef9c387a20565ea75d6a846ca354b2052709f6).

    Code tested in the image for reference:enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search