Amazon web services - AWS Glue/Athena: Combine many small parquet files for performance

mfcss
September 22, 2023
155 views
0 votes
2 Answers

I have a use case as follows:

An IoT device is uploading many small files into an AWS S3 input bucket. Upon upload, every single file is processed by an AWS Lambda function to decode the data and write a parquet output file based on the uploaded IoT log file.

The parquet file is written to an output S3 bucket, which serves as my data lake. The output S3 bucket is crawled by AWS Glue and queried via AWS Athena.

The structure of my output S3 bucket is as follows:

s3://outputbucket/deviceid/tablename/fileid.parquet

Note that I deliberately do not have partitions as this is not possible within my use case.

My challenge is this:

Due to the file-by-file processing, I get a very large number of small parquet files (e.g. millions of 5-100 kb files) in my data lake, reducing Athena’s performance.

I would like to deploy a serverless AWS service that will on-demand "concatenate/combine" my small parquet files into larger parquet files in a simple way. The resulting larger parquet files should be compressed etc. For example, a single table folder may contain 1 GB of data across 10K files – and the service/job should "transform" this folder to contain 1 GB of data across 10 files.

At the same time, I wish to ensure that these combined files remain in my output S3 bucket (rather than put them into a separate S3 bucket), as I ideally want the ‘combined’ data to be queried together with the new incoming data from my existing Lambda functions.

Can this be done via e.g. AWS Glue, AWS Athena or some other solution?

Answers

- awesome_crab
- September 22, 2023 at 1:42 pm
- 0 votes
0
Sorry, I am no aws expert. Usually I fail understanding their docs…
However, maybe S3 selcet could be of interest for you:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/selecting-content-from-objects.html

with this you can select columns from csv, parquet and json files. I would expect it to work also across multiple files. Hope that is of help and good luck to you!

Login or Signup to reply.

- RobertKossendey
- September 22, 2023 at 2:45 pm
- 0 votes
0
You can use AWS Glue ETL to repartition your table, thus compacting files.

Also it might be interesting for you to look into Delta Lake, since this provides compaction out of the box

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – AWS Glue/Athena: Combine many small parquet files for performance

Answers