skip to Main Content

I’ve seen many iterations of this question but cannot seem to understand/fix this behavior.

I am on Azure Databricks working on DBR 10.4 LTS Spark 3.2.1 Scala 2.12 trying to write a single csv file to blob storage so that it can be dropped to an SFTP server. Could not use spark-sftp because I am on Scala 2.12 unfortunately and could not get the library to work.

Given this is a small dataframe, I am converting it to pandas and then attempting to_csv.

to_export = df.toPandas()

to_export.to_csv(pathToFile, index = False)

I get the error: [Errno 2] No such file or directory: '/dbfs/mnt/adls/Sandbox/user/project_name/testfile.csv

Based on the information in other threads, I create the directory with dbutils.fs.mkdirs("/dbfs/mnt/adls/Sandbox/user/project_name/") /n Out[40]: True

The response is true and the directory exists, yet I still get the same error. I’m convinced it is something obvious and I’ve been staring at it for too long to notice. Does anyone see what my error may be?

2

Answers


    • Python’s pandas library recognizes the path only when it is in File API Format (since you are using mount). And dbutils.fs.mkdirs uses Spark API Format which is different from File API Format.

    • As you are creating the directory using dbutils.fs.mkdirs with path as /dbfs/mnt/adls/Sandbox/user/project_name/, this path would be actually considered as dbfs:/dbfs/mnt/adls/Sandbox/user/project_name/. Hence, the directory would be created within DBFS.

    dbutils.fs.mkdirs('/dbfs/mnt/repro/Sandbox/user/project_name/')
    

    enter image description here

    • So, you have to create the directory by modify the code to create directory to the following code:
    dbutils.fs.mkdirs('/mnt/repro/Sandbox/user/project_name/')
    #OR
    #dbutils.fs.mkdirs('dbfs:/mnt/repro/Sandbox/user/project_name/')
    
    • Writing to the folder would now work without any issue.
    pdf.to_csv('/dbfs/mnt/repro/Sandbox/user/project_name/testfile.csv', index=False)
    

    enter image description here

    Login or Signup to reply.
  1. Are you working in a repo? Because if you are, .to_csv() will try to save in the working directory of your repo and will not be able to access dbfs.

    to export your spark df as csv to dbfs try:

    sparkdf.coalesce(1) 
           .write.format("com.databricks.spark.csv") 
           .option("header", "true") 
           .save("dbfs:/path/to/file.csv")
    

    your csv file will be at dbfs:/path/to/file.csv/part-00000-tid-XXXXXXXX.csv

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search