I have a Spark Dataframe which contains in every single row two items: a file name (with an extension, like for instance .jpg), and the content of the file in bytes.
I would like to write a process that takes each row of the Dataframe and converts the bytes to a ‘.jpg’ image while I store it into an ADLS container.
Everything needs to run inside a Databricks cluster so I use pyspark in order to create the Dataframe, and I would like to make use of it to truncate those files into the destination.
However, I am having trouble when I use the azure-storage library to write those files by using it inside a map function. Like the following example, where the function consume_row
uses the library to create the file and write the content like this example:
results_rdd = rdd.map(lambda row: consume_row(row, ...))
It returns the following error:
PicklingError: Could not serialize object: TypeError: cannot pickle ‘_thread._local’ object
Has anyone tried to do anything similar to this?
2
Answers
The problem was inside the function
consume_row
. We were using a variable to store the token for the API which, underneath, was using a Thread Local Python object to store the token itself, and this was not serializable by pickle to be sent to workers. So we just needed to pass the token to the function and everything is working perfectly.PySpark uses Python’s
pickle
module for serialization, and certain objects (like_thread._local
objects) cannot be pickled.To achieve your goal, you need to mount your adls2 account and save the file.
Follow the steps below.
Here, I am mounting using the account key. You can also mount using a
SAS
token orOAuth
.Next, use the code below.
Output: