How can I prevent duplicate data transfer to an Amazon Redshift table using an AWS Glue job? I have a scenario where daily CSV files are added to an S3 bucket, and my Glue job, which transfers data from these CSV files to a Redshift table, repeats the transfer of all files each time it runs. Is there a way to avoid duplicating data in this process?
I tried modifying the glue script but it’s not working.
2
Answers
Enable job bookmarks.
Alternatively, you can develop a personalized Python script responsible for duplicating all the files that have undergone processing within a designated directory named ‘importedFiles.’ This way, each time your task initiates a new execution, you will have a precise record of the files you’ve handled.
Use spectrum to query the data from s3 without loading it into redshift, no need for a scheduled job. Or use the new preview auto copy capability