skip to Main Content

Using PySpark 2.4.7

My goal is to write PySpark DataFrame into specific number of parquet files into AWS S3.

Say I want to write PySpark DataFrame into 10 parquet files.
This is my approach

df.repartition(10).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)

This writes 10 parquet files for each partition bucket in S3. I want to write 10 (approx) in total across all partition columns, How can I achieve this?

It could be achieved as Abdennacer Lachiheb mentioned in the answer. However as per my comment, there are imbalance within partition columns therefore simply dividing total number of files into number of partition columns wouldn’t be optimal answer. In such case I would end up with 5 files with 10mb per file for train partition and 5 files with 1mb per file for valid partition. I want it to have similar file sizes. Furthermore I could achieve this by using stratifying it but wondering if there are simpler way of achieving it.

2

Answers


  1. We know that the number of files is the sum of all files across all partitions:

    nb_all_files = nb_files_per_partitions * len(partition_cols)
    

    so then nb_files_per_partitions = nb_all_files / len(partition_cols)

    so in your case:

    nb_files_per_partitions = 10 / len(partition_cols)
    

    Final result:

    df.repartition(10/len(partition_cols)).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)
    
    Login or Signup to reply.
  2. You can use coalesce when writing the dataframe out:

    df.coalesce(partition_count).write.parquet(<storage account path>)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search