Writing to Azure Blob Storage from local spark environment

chrish
May 27, 2024
226 views
1 vote
2 Answers

I’ve been going round in circles trying to write to a blob storage account in azure. Currently i’m creating a spark session with the following setup:

spark = SparkSession.builder 
.appName("Azure Blob Storage Access") 
.config("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.3.1
    ,com.microsoft.azure:azure-storage-blob:11.0.1
    ,org.apache.hadoop:hadoop-azure:3.4.0
    ,org.apache.hadoop:hadoop-azure:3.3.1
    ,org.eclipse.jetty:jetty-util:11.0.7
    ,org.apache.hadoop.thirdparty:hadoop-shaded-guava:1.1.1
    ,org.apache.httpcomponents:httpclient:4.5.13
    ,com.fasterxml.jackson.core:jackson-databind:2.13.1
    ,com.fasterxml.jackson.core:jackson-core:2.13.1
    ,org.eclipse.jetty:jetty-util-ajax:11.0.7
    ,org.apache.hadoop:hadoop-common:3.3.1
    ,com.microsoft.azure:azure-keyvault-core:1.2.6
    ") 
.getOrCreate()

Which results in the following error:

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem$Secure not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2592)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2686)

I’ve also tried installing the above jars on the following path

/opt/homebrew/Cellar/apache-spark/3.5.1/libexec/jars

But end up with the same error being kicked back. I’m at a bit of a loss where to go with this. I’ve seen a couple of success stories running this code as either setting the jar parameter within the session config or by having them installed locally. How the hell do i get this working?

UPDATE:
I’ve confirmed SPARK_HOME and JAVA_HOME paths in the .env With the jar files locally i’ve managed to get a different error now.

Py4JJavaError: An error occurred while calling o307.csv.
: java.lang.NoClassDefFoundError: org/eclipse/jetty/util/ajax/JSON$Convertor

Not sure this is any better but at least its different

Answers

Chosen as BEST ANSWER
- chrish
- May 27, 2024 at 6:17 pm
- 0 votes
0
Root cause finally identified as Jetty-Util and Jetty-Util-Ajax version being 11 instead of 9. Thanks to the answers above pushing I managed to confirm the azure and hadoop jars were not the issue. Eventually I came across this link about Jetty Util having deprecated classes in Jetty Util Version 10+. I dropped down to Jetty-Util and Jetty-Util-Ajax v9.4.45 from Maven, updated my JDK to Open-JDK 16 and successfully wrote to the Azure Storage Account.

(Edit)

Here’s how I did it:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("Azure Blob Storage Access")
    .config(
        "spark.jars.packages",
        ",".join(
            [
                "org.apache.hadoop:hadoop-azure:3.3.1",
                "com.microsoft.azure:azure-storage:8.6.6",
            ]
        ),
    )
    .getOrCreate()
)

SECRET_ACCESS_KEY = "**secret_account_key**"
STORAGE_NAME = "**storage_account_name**"
file_path = "abfss://**container**@**storage_account_name**.dfs.core.windows.net/"

spark.sparkContext._jsc.hadoopConfiguration().set(
    "fs.azure.account.key." + STORAGE_NAME + ".dfs.core.windows.net", SECRET_ACCESS_KEY
)

data = [
    ("X", "20-01-2023", "N"),
    ("X", "21-01-2023", "S"),
    ("X", "22-01-2023", "S"),
    ("X", "23-01-2023", "N"),
    ("X", "24-01-2023", "E"),
    ("X", "25-01-2023", "E"),
    ("Y", "20-01-2023", "S"),
    ("Y", "23-01-2023", "S"),
]
# Create DataFrame
df = spark.createDataFrame(data, ["id", "date", "state"])

# Write the dataframe to the Azure ADLS location:
df.write.format("csv").mode("overwrite").save(file_path + "magic.csv")

# Read the data from the Azure ADLS location:
spark.read.format("csv").load(file_path + "magic.csv").show()

Please signup or login to give your own answer.

Click here to cancel reply.