skip to Main Content

How can I prevent duplicate data transfer to an Amazon Redshift table using an AWS Glue job? I have a scenario where daily CSV files are added to an S3 bucket, and my Glue job, which transfers data from these CSV files to a Redshift table, repeats the transfer of all files each time it runs. Is there a way to avoid duplicating data in this process?

I tried modifying the glue script but it’s not working.

2

Answers


  1. Enable job bookmarks.

    Alternatively, you can develop a personalized Python script responsible for duplicating all the files that have undergone processing within a designated directory named ‘importedFiles.’ This way, each time your task initiates a new execution, you will have a precise record of the files you’ve handled.

    Login or Signup to reply.
  2. Use spectrum to query the data from s3 without loading it into redshift, no need for a scheduled job. Or use the new preview auto copy capability

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search