How to mount file as a file object using PySpark in Azure Synapse

VithalRd
August 1, 2022
343 views
3 votes
3 Answers

I have an azure storage account (Storage gen2) and need to copy files like config.yaml, text files, gz files to reference them inside my code.
I have tried the steps listed in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api, but what this does is to mount a filesystem. If you reference it using for ex:
yaml_file_test = mssparkutils.fs.head("synfs:/79/myMount/Test2/config.yaml",100), it returns a spark dataframe and not a file.

The yaml file contains lot of local variables defined to be used through out the project.

What I’m trying to achieve is, something like below.

with open('synfs:/80/myMount/Test2/config.yaml') as f:
    data = yaml.load(f, Loader=SafeLoader)
    print(data)

The problem is Pyspark doesn’t recoginse the path and gives an error: FileNotFoundError: [Errno 2] No such file or directory: 'synfs:/80/myMount/Test2/config.yaml'

I have to access other files too in similar manner and mount them as file objects to traverse and do some operations. For example, some of the libraries like wordninja expect a "gz" file and not a dataframe. When i try that, I get the above error.

If my approach is not correct, can anyone help on how do we actually create global variables inside Azure Synapse environment and how to actually create file objects from a azure storage.

Just to notify, I have also tried other methods of reading from storage like below, but the problem is that all of them return files in a path to read into a dataframe only.

spark.conf.set("spark.storage.synapse.linkedServiceName", LinkService)
        spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
        print("Connection Setup Successful!")
        return
    except Exception as e:
        print("Connection Setup Failed!- "+str(e))
        return -1

def spark_init(app_name: str = 'Mytest'):
    spark = SparkSession.builder.appName(app_name).getOrCreate()
    sc = spark.sparkContext
    return (spark, sc)

def getStream(streamsetlocation) :

  try:

    spark, sc = spark_init()
    setupConnection(spark,LinkService)
    print(streamsetlocation)
    dfStandardized = spark.read.format("csv").options(header=True).load(streamsetlocation)

Any help would be deeply appreciated.

Answers

Chosen as BEST ANSWER
- VithalRd
- August 24, 2022 at 6:31 am
- 0 votes
0
I could not get the above mount point to read/write binary files. But used fsspec to write a Python pickle file and read it back from Azure Blob Storage.
```
filename = 'final_model.sav'
sas_key = TokenLibrary.getConnectionString('')
storage_account_name = ‘’
container = ‘’
fsspec_handle = fsspec.open(f'abfs://{container}/{filename}', account_name = storage_account_name, sas_token=sas_key, mode='wb')
with fsspec_handle.open() as o_file:
pickle.dump(model, o_file)
```

(Edit)

- Nickikku
- October 6, 2022 at 9:29 am
- 0 votes
0
If found this answer to be the solution for the same problem I faced:

ShaikMaheer-MSFT avatar imageShaikMaheer-MSFTFollow
Microsoft Employee
24836
Reputation
2652
Posts
0
Following
62
Followers
answered • Feb 15 2022 at 5:25 PM | ShaikMaheer-MSFT commented • Feb 17 2022 at 10:25 AM BEST ANSWERACCEPTED ANSWER
Hi @gmfx-5106 ,
Got response from PG. Below are the details.

Currently file mount API will always do mount within blob endpoint instead of dfs, so please make sure to create a MPE (Managed Private Endpoint) to blob endpoint instead of dfs.

Implementation of mounting to always use dfs endpoint for gen2 storage will be available soon. No ETA at this moment. Thank you.

Hope this helps.

Login or Signup to reply.

- JakeKeniston
- November 9, 2022 at 7:45 pm
- 0 votes
0
Most python packages expect a local file system. The open command likely isn’t working because it is looking for the YAML’s path in the cluster’s file system.

You can create a temp directory on the cluster and copy the file there. "/tmp" already exists on the cluster, so I typically create "/tmp/temp". The code to copy the file there would then be:
```
# NOTE: mssparkutils.fs.cp also creates a .crc file when copying to local storage
mssparkutils.fs.cp('synfs:/80/myMount/Test2/config.yaml', 'file:/tmp/temp/config.yaml')
```
After copying the file over, this code should work to open the file:
```
with open('/tmp/temp/config.yaml') as f:
    data = yaml.load(f, Loader=SafeLoader)
```
That being said, you can also directly read the YAML from storage as a string by using:
```
# Returns pyspark.rdd.RDD object
file_rdd = spark.read.text('synfs:/80/myMount/Test2/config.yaml', wholetext=True).rdd
# Returns string
yaml_data = file_rdd.take(1)[0]['value']
```
From there it should be fairly straightforward to parse the values from the string. If you are using a Python package to manipulate the GZ files, you will most likely need to copy the GZ files to the cluster first.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.