skip to Main Content

I have an azure storage account (Storage gen2) and need to copy files like config.yaml, text files, gz files to reference them inside my code.
I have tried the steps listed in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api, but what this does is to mount a filesystem. If you reference it using for ex:
yaml_file_test = mssparkutils.fs.head("synfs:/79/myMount/Test2/config.yaml",100), it returns a spark dataframe and not a file.

The yaml file contains lot of local variables defined to be used through out the project.

What I’m trying to achieve is, something like below.

with open('synfs:/80/myMount/Test2/config.yaml') as f:
    data = yaml.load(f, Loader=SafeLoader)
    print(data)

The problem is Pyspark doesn’t recoginse the path and gives an error: FileNotFoundError: [Errno 2] No such file or directory: 'synfs:/80/myMount/Test2/config.yaml'

I have to access other files too in similar manner and mount them as file objects to traverse and do some operations. For example, some of the libraries like wordninja expect a "gz" file and not a dataframe. When i try that, I get the above error.

If my approach is not correct, can anyone help on how do we actually create global variables inside Azure Synapse environment and how to actually create file objects from a azure storage.

Just to notify, I have also tried other methods of reading from storage like below, but the problem is that all of them return files in a path to read into a dataframe only.

spark.conf.set("spark.storage.synapse.linkedServiceName", LinkService)
        spark.conf.set("fs.azure.account.oauth.provider.type", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedTokenProvider")
        print("Connection Setup Successful!")
        return
    except Exception as e:
        print("Connection Setup Failed!- "+str(e))
        return -1

def spark_init(app_name: str = 'Mytest'):
    spark = SparkSession.builder.appName(app_name).getOrCreate()
    sc = spark.sparkContext
    return (spark, sc)

def getStream(streamsetlocation) :

  try:

    spark, sc = spark_init()
    setupConnection(spark,LinkService)
    print(streamsetlocation)
    dfStandardized = spark.read.format("csv").options(header=True).load(streamsetlocation)

Any help would be deeply appreciated.

3

Answers


  1. Chosen as BEST ANSWER

    I could not get the above mount point to read/write binary files. But used fsspec to write a Python pickle file and read it back from Azure Blob Storage.

    filename = 'final_model.sav'
    sas_key = TokenLibrary.getConnectionString('')
    storage_account_name = ‘’
    container = ‘’
    fsspec_handle = fsspec.open(f'abfs://{container}/{filename}', account_name = storage_account_name, sas_token=sas_key, mode='wb')
    with fsspec_handle.open() as o_file:
    pickle.dump(model, o_file)
    

  2. If found this answer to be the solution for the same problem I faced:

    ShaikMaheer-MSFT avatar imageShaikMaheer-MSFTFollow
    Microsoft Employee
    24836
    Reputation
    2652
    Posts
    0
    Following
    62
    Followers
    answered • Feb 15 2022 at 5:25 PM | ShaikMaheer-MSFT commented • Feb 17 2022 at 10:25 AM BEST ANSWERACCEPTED ANSWER
    Hi @gmfx-5106 ,
    Got response from PG. Below are the details.

    Currently file mount API will always do mount within blob endpoint instead of dfs, so please make sure to create a MPE (Managed Private Endpoint) to blob endpoint instead of dfs.

    Implementation of mounting to always use dfs endpoint for gen2 storage will be available soon. No ETA at this moment. Thank you.

    Hope this helps.

    Login or Signup to reply.
  3. Most python packages expect a local file system. The open command likely isn’t working because it is looking for the YAML’s path in the cluster’s file system.

    You can create a temp directory on the cluster and copy the file there. "/tmp" already exists on the cluster, so I typically create "/tmp/temp". The code to copy the file there would then be:

    # NOTE: mssparkutils.fs.cp also creates a .crc file when copying to local storage
    mssparkutils.fs.cp('synfs:/80/myMount/Test2/config.yaml', 'file:/tmp/temp/config.yaml')
    

    After copying the file over, this code should work to open the file:

    with open('/tmp/temp/config.yaml') as f:
        data = yaml.load(f, Loader=SafeLoader)
    

    That being said, you can also directly read the YAML from storage as a string by using:

    # Returns pyspark.rdd.RDD object
    file_rdd = spark.read.text('synfs:/80/myMount/Test2/config.yaml', wholetext=True).rdd
    # Returns string
    yaml_data = file_rdd.take(1)[0]['value']
    

    From there it should be fairly straightforward to parse the values from the string. If you are using a Python package to manipulate the GZ files, you will most likely need to copy the GZ files to the cluster first.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search