skip to Main Content

I started using AWS Data Wrangler and Athena, to upload my data files onto S3, and being able to query them, respectively. My question is about the procedure to "safely" update the data in the table.

Here is what I did:

  1. I used the AWS Data Wrangler,

    wr.s3.to_parquet(
    df=my_df,
    path='s3://temp',
    dataset=True,
    partition_cols = ['date'],
    concurrent_partitioning = True,
    mode = append
    )
    

    to export my DataFrame to S3 -as parquet files.

  2. I used AWS Glue to crawl my data files in S3, and create Table A.

  3. Via Athena, I can query my data in Table A.

Now I would like to update my data in table A. My questions are:

  1. (insert new rows):
    If I follow what I did above, and change the mode to "overwrite", then all my data in S3 is removed (expected, not desirable!), and the new data files is written onto S3. Thus, the Glue catalogue crawl results in NEW data rows in Table A, while removing all my past data in Table A.
    This is not my desire. Is there a safe way to do this?
  2. (update rows):
    What if I would like to UPDATE data in Table A. Using AWS Wrangler, How can I achieve this, without adding redundant rows in Table A?

I have tested (2), using mode=append and indeed I ended up getting redundant rows in my Table A.

I appreciate if you kindly share your thoughts

Thanks,

2

Answers


  1. I think you need use mode="overwrite_partitions"

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search