Amazon web services - Avoid duplicate data when AWS Glue job feeds Amazon Redshift database

Srikanth
January 13, 2024
177 views
1 vote
2 Answers

How can I prevent duplicate data transfer to an Amazon Redshift table using an AWS Glue job? I have a scenario where daily CSV files are added to an S3 bucket, and my Glue job, which transfers data from these CSV files to a Redshift table, repeats the transfer of all files each time it runs. Is there a way to avoid duplicating data in this process?

I tried modifying the glue script but it’s not working.

Answers

- Bogdan
- January 12, 2024 at 1:49 am
- 0 votes
0
Enable job bookmarks.

Alternatively, you can develop a personalized Python script responsible for duplicating all the files that have undergone processing within a designated directory named ‘importedFiles.’ This way, each time your task initiates a new execution, you will have a precise record of the files you’ve handled.

Login or Signup to reply.

- user433342
- January 13, 2024 at 10:09 pm
- 0 votes
0
Use spectrum to query the data from s3 without loading it into redshift, no need for a scheduled job. Or use the new preview auto copy capability

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – Avoid duplicate data when AWS Glue job feeds Amazon Redshift database

Answers