Export a Spark Dataframe (pyspark.pandas.Dataframe) to Excel file from Azure DataBricks

walzer91
September 14, 2022
221 views
3 votes
3 Answers

I’m struggling with the export of a pyspark.pandas.Dataframe to an Excel file.

I’m working on an Azure Databricks Notebook with Pyspark.
My goal is to read a csv file from Azure Data Lake Storage container and store it as a Excel file on another ADLS container.

I’m finding so many difficulties related to performances and methods.
pyspark.pandas.Dataframe has a built-in to_excel method but with files larger than 50MB the commands ends with time-out error after 1hr (seems to be a well known problem).

Following you can find an example of code. It ends by saving the file on the DBFS (there are still problems integrating the to_excel method with Azure) and then I move the file to the ADLS.

import pyspark.pandas as ps
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net", storage_account_key)

reference_path = f'abfss://{source_container_nae}@{storage_account_name}.dfs.core.windows.net/{file_name}'

df = ps.read_csv(reference_path, index=None)

df.to_excel(file_name, sheet_name='sheet')

pyspark.pandas.Dataframe is the suggested method by Databricks in order to work with Dataframes (it replaces koalas) but I can’t find any solution to my problem, except converting the dataframe to a normal pandas one.

Can please someone help me?

Thanks in advance!

UPDATE

Some more information of the whole pipeline.

I have a DataFactory pipeline that reads data from Azure Synapse, elaborate them and store them as csv files in ADLS. I need DataBricks because DataFactory does not have a native sink Excel connector!
I know that I can use instead Azure Functions or Kubernetes, but I started using DataBricks hoping that it was possible…

Answers

Chosen as BEST ANSWER
- walzer91
- September 20, 2022 at 10:48 am
- 0 votes
0
I've found a solution to the problem with the pyexcelerate package:
```
from pyexcelerate import Workbook

df = # read your dataframe

values = df.columns.to_list() + list(df.values)
sheet_name = 'Sheet'

wb = Workbook()
wb.new_sheet(sheet_name, data=values)
wb.save(file_name)
```
In this way Databricks succeed in elaborating a 160MB dataset and exporting to Excel in 3 minutes.

Let me know if you find a better solution!

(Edit)

- vladc
- September 14, 2022 at 2:13 pm
- 0 votes
0
You should not convert a big spark dataframe to pandas because you probably will not be able to allocate so much memory.
You can write it as a csv and it will be available to open in excel:
```
df.to_csv(path=file_name, num_files=1)
```
Login or Signup to reply.

- Bj248rnJ248rgensen
- September 15, 2022 at 10:37 pm
- 0 votes
0
Hm.. it looks like you are reading the same file and saving to the same file.

can you change
```
df.to_excel(file_name, sheet_name='sheet')
```
to
```
df.to_excel("anotherfilename.xlsx", sheet_name='sheet')
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.