I started using AWS Data Wrangler
and Athena
, to upload my data files onto S3, and being able to query them, respectively. My question is about the procedure to "safely" update the data in the table.
Here is what I did:
-
I used the AWS Data Wrangler,
wr.s3.to_parquet( df=my_df, path='s3://temp', dataset=True, partition_cols = ['date'], concurrent_partitioning = True, mode = append )
to export my
DataFrame
to S3 -asparquet
files. -
I used AWS Glue to crawl my data files in S3, and create
Table A
. -
Via Athena, I can query my data in
Table A
.
Now I would like to update my data in table A. My questions are:
- (insert new rows):
If I follow what I did above, and change the mode to "overwrite", then all my data in S3 is removed (expected, not desirable!), and the new data files is written onto S3. Thus, the Glue catalogue crawl results in NEW data rows in Table A, while removing all my past data inTable A
.
This is not my desire. Is there a safe way to do this? - (update rows):
What if I would like to UPDATE data inTable A
. Using AWS Wrangler, How can I achieve this, without adding redundant rows inTable A
?
I have tested (2), using mode=append
and indeed I ended up getting redundant rows in my Table A
.
I appreciate if you kindly share your thoughts
Thanks,
2
Answers
I think you need use
mode="overwrite_partitions"
You can use the below API:
Ref: https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.s3.merge_upsert_table.html