Using PySpark 2.4.7
My goal is to write PySpark DataFrame into specific number of parquet files into AWS S3.
Say I want to write PySpark DataFrame into 10 parquet files.
This is my approach
df.repartition(10).write.mode("append").partitionBy(partition_cols).saveAsTable(<db.table>)
This writes 10 parquet files for each partition bucket in S3. I want to write 10 (approx) in total across all partition columns, How can I achieve this?
It could be achieved as Abdennacer Lachiheb
mentioned in the answer. However as per my comment, there are imbalance within partition columns therefore simply dividing total number of files into number of partition columns wouldn’t be optimal answer. In such case I would end up with 5 files with 10mb per file for train
partition and 5 files with 1mb per file for valid
partition. I want it to have similar file sizes. Furthermore I could achieve this by using stratifying it but wondering if there are simpler way of achieving it.
2
Answers
We know that the number of files is the sum of all files across all partitions:
so then nb_files_per_partitions = nb_all_files / len(partition_cols)
so in your case:
Final result:
You can use
coalesce
when writing the dataframe out: