I have the .jpg pictures in the data lake in my blob storage. I am trying to load the pictures and display them for testing purposes but it seems like they can’t be loaded properly. I tried a few solutions but none of them showed the pictures.
path = 'abfss://[email protected]/username/project_name/test.jpg'
spark.read.format("image").load(path) -- Method 1
display(spark.read.format("binaryFile").load(pic)) -- Method 2
Method 1 brought this error. It looks like a binary file (converted from jpg to binary) and that’s why I tried solution 2 but that did not load anything either.
Out[51]: DataFrame[image: struct<origin:string,height:int,width:int,nChannels:int,mode:int,data:binary>]
For method 2, I see this error when I ran it.
SecurityException: Your administrator has forbidden Scala UDFs from being run on this cluster
I cannot install the libraries very easily. It needs to be reviewed and approved by the administrators first so please suggest something with spark and/or python libraries if possible. Thanks
Edit:
I added these 2 lines and it looks like the image has been read but it cannot be loaded for some reason. I am not sure what’s going on. The goal is to read it properly and decode the pictures eventually but it cannot happen until the picture is loaded properly.
df = spark.read.format("image").load(path)
df.show()
df.printSchema()
2
Answers
Finally after spending hours on this, I found a solution which was pretty straight forward but drove me crazy. I was on a right path but needed to print the pictures using "binaryFile" format in Spark. Here is what worked for me.
It looks like binaryFile is a right format at least in this case and the upper code was able to decode it successfully.
I tried to reproduce the same in my environment, loading dataset into databricks and got below results:
Or
Now, using below code, I got these results.
Update:
Reference:
Mounting Azure Data Lake Storage Gen2 Account in Databricks By Ron L’Esteve.