I have an azure storage account (Storage gen2) and need to copy files like config.yaml, text files, gz files to reference them inside my code.
I have tried the steps listed in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api, but what this does is to mount a filesystem. If you reference it using for ex:
yaml_file_test = mssparkutils.fs.head("synfs:/79/myMount/Test2/config.yaml",100)
, it returns a spark dataframe and not a file.
The yaml file contains lot of local variables defined to be used through out the project.
What I’m trying to achieve is, something like below.
with open('synfs:/80/myMount/Test2/config.yaml') as f:
data = yaml.load(f, Loader=SafeLoader)
print(data)
The problem is Pyspark doesn’t recoginse the path and gives an error: FileNotFoundError: [Errno 2] No such file or directory: 'synfs:/80/myMount/Test2/config.yaml'
I have to access other files too in similar manner and mount them as file objects to traverse and do some operations. For example, some of the libraries like wordninja expect a "gz" file and not a dataframe. When i try that, I get the above error.
If my approach is not correct, can anyone help on how do we actually create global variables inside Azure Synapse environment and how to actually create file objects from a azure storage.
Just to notify, I have also tried other methods of reading from storage like below, but the problem is that all of them return files in a path to read into a dataframe only.
spark.conf.set("spark.storage.synapse.linkedServiceName", LinkService)
spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
print("Connection Setup Successful!")
return
except Exception as e:
print("Connection Setup Failed!- "+str(e))
return -1
def spark_init(app_name: str = 'Mytest'):
spark = SparkSession.builder.appName(app_name).getOrCreate()
sc = spark.sparkContext
return (spark, sc)
def getStream(streamsetlocation) :
try:
spark, sc = spark_init()
setupConnection(spark,LinkService)
print(streamsetlocation)
dfStandardized = spark.read.format("csv").options(header=True).load(streamsetlocation)
Any help would be deeply appreciated.
3
Answers
I could not get the above mount point to read/write binary files. But used fsspec to write a Python pickle file and read it back from Azure Blob Storage.
If found this answer to be the solution for the same problem I faced:
ShaikMaheer-MSFT avatar imageShaikMaheer-MSFTFollow
Microsoft Employee
24836
Reputation
2652
Posts
0
Following
62
Followers
answered • Feb 15 2022 at 5:25 PM | ShaikMaheer-MSFT commented • Feb 17 2022 at 10:25 AM BEST ANSWERACCEPTED ANSWER
Hi @gmfx-5106 ,
Got response from PG. Below are the details.
Currently file mount API will always do mount within blob endpoint instead of dfs, so please make sure to create a MPE (Managed Private Endpoint) to blob endpoint instead of dfs.
Implementation of mounting to always use dfs endpoint for gen2 storage will be available soon. No ETA at this moment. Thank you.
Hope this helps.
Most python packages expect a local file system. The open command likely isn’t working because it is looking for the YAML’s path in the cluster’s file system.
You can create a temp directory on the cluster and copy the file there. "/tmp" already exists on the cluster, so I typically create "/tmp/temp". The code to copy the file there would then be:
After copying the file over, this code should work to open the file:
That being said, you can also directly read the YAML from storage as a string by using:
From there it should be fairly straightforward to parse the values from the string. If you are using a Python package to manipulate the GZ files, you will most likely need to copy the GZ files to the cluster first.