skip to Main Content

I’ve been going round in circles trying to write to a blob storage account in azure. Currently i’m creating a spark session with the following setup:

spark = SparkSession.builder 
.appName("Azure Blob Storage Access") 
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.3.1
    ,com.microsoft.azure:azure-storage-blob:11.0.1
    ,org.apache.hadoop:hadoop-azure:3.4.0
    ,org.apache.hadoop:hadoop-azure:3.3.1
    ,org.eclipse.jetty:jetty-util:11.0.7
    ,org.apache.hadoop.thirdparty:hadoop-shaded-guava:1.1.1
    ,org.apache.httpcomponents:httpclient:4.5.13
    ,com.fasterxml.jackson.core:jackson-databind:2.13.1
    ,com.fasterxml.jackson.core:jackson-core:2.13.1
    ,org.eclipse.jetty:jetty-util-ajax:11.0.7
    ,org.apache.hadoop:hadoop-common:3.3.1
    ,com.microsoft.azure:azure-keyvault-core:1.2.6
    ") 
.getOrCreate()

Which results in the following error:

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)

I’ve also tried installing the above jars on the following path

/opt/homebrew/Cellar/apache-spark/3.5.1/libexec/jars

But end up with the same error being kicked back. I’m at a bit of a loss where to go with this. I’ve seen a couple of success stories running this code as either setting the jar parameter within the session config or by having them installed locally. How the hell do i get this working?

UPDATE:
I’ve confirmed SPARK_HOME and JAVA_HOME paths in the .env With the jar files locally i’ve managed to get a different error now.

Py4JJavaError: An error occurred while calling o307.csv.
: java.lang.NoClassDefFoundError: org/eclipse/jetty/util/ajax/JSON$Convertor

Not sure this is any better but at least its different

2

Answers


  1. Chosen as BEST ANSWER

    Root cause finally identified as Jetty-Util and Jetty-Util-Ajax version being 11 instead of 9. Thanks to the answers above pushing I managed to confirm the azure and hadoop jars were not the issue. Eventually I came across this link about Jetty Util having deprecated classes in Jetty Util Version 10+. I dropped down to Jetty-Util and Jetty-Util-Ajax v9.4.45 from Maven, updated my JDK to Open-JDK 16 and successfully wrote to the Azure Storage Account.


  2. Here’s how I did it:

    from pyspark.sql import SparkSession
    
    spark = (
        SparkSession.builder.appName("Azure Blob Storage Access")
        .config(
            "spark.jars.packages",
            ",".join(
                [
                    "org.apache.hadoop:hadoop-azure:3.3.1",
                    "com.microsoft.azure:azure-storage:8.6.6",
                ]
            ),
        )
        .getOrCreate()
    )
    
    SECRET_ACCESS_KEY = "**secret_account_key**"
    STORAGE_NAME = "**storage_account_name**"
    file_path = "abfss://**container**@**storage_account_name**.dfs.core.windows.net/"
    
    spark.sparkContext._jsc.hadoopConfiguration().set(
        "fs.azure.account.key." + STORAGE_NAME + ".dfs.core.windows.net", SECRET_ACCESS_KEY
    )
    
    data = [
        ("X", "20-01-2023", "N"),
        ("X", "21-01-2023", "S"),
        ("X", "22-01-2023", "S"),
        ("X", "23-01-2023", "N"),
        ("X", "24-01-2023", "E"),
        ("X", "25-01-2023", "E"),
        ("Y", "20-01-2023", "S"),
        ("Y", "23-01-2023", "S"),
    ]
    # Create DataFrame
    df = spark.createDataFrame(data, ["id", "date", "state"])
    
    # Write the dataframe to the Azure ADLS location:
    df.write.format("csv").mode("overwrite").save(file_path + "magic.csv")
    
    # Read the data from the Azure ADLS location:
    spark.read.format("csv").load(file_path + "magic.csv").show()
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search