skip to Main Content

AWS glue is new to me.
I’m working with AWS Glue and I’m trying to read from RDS database table and write the data to an s2 bucket as a single csv file. I’ve setup an AWS Glue job using Visual ETL in the Glue console, selecting the source as relational database and target as an S3 bucket with a table (I’ve created a crawler for the schema). The job succeeded, but I’m noticing that I’m getting 10 separate files in the target s3 bucket location. Does anyone have insights on how to achieve this in a way that I get only one consolidated file in the s3 bucket.

2

Answers


  1. Each Glue Job is executed on multiple instances in parallel.

    Your issue is most likely related to default number of Glue Job workers (it’s 10).

    You have to implement output merge operation in glue job code.

    Check this:

    aws Glue job: how to merge multiple output .csv files in s3

    Login or Signup to reply.
  2. When you’re using AWS Glue, it’s good to know that Glue works with Spark in the background. Spark usually spreads data across many files to make things faster, which is great for big sets of data. But sometimes, you end up with lots of files.

    To address this, you can use the coalesce function within the Glue job script to reduce the number of partitions and consolidate the data into a single file. Here’s a simple example:

    # put all the data into just one file
    df.coalesce(1).write...
    

    But be careful: using coalesce might slow things down, especially with a lot of data, as it involves data shuffling to a single partition, potentially impacting performance. So, it’s a bit like finding a balance between getting one file and keeping things moving fast, depending on how much data you have and what you need to do with it.

    Spark – repartition() vs coalesce()

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search