I know that athena store every query results in the bucket and query data will just accumulate over time. I want to know whether retaining previous query results in S3 would make an impact to performance of my queries.
For background, I have AWS services(Glue and Lambda) that uses athena to return data and mostly my query results would change frequently. I noticed that there are 200GB worth of data in my S3 now. Currently, it has only archive configurations. I’m thinking of adding life cycle rule that will only retain worth 7 days or 30 days. Is query result really important to be in s3 if we are not really using it?
2
Answers
These are two completely different things. Query results are stored in
S3
results location and the Glue Crawler runs over the Source Files. There is NO performance impact of having history of Query results.Query results can be used for a limited amount of time by athena if you benefit of reuse query results feature, or caching in AWS data wrangler library. For the remaining scenarios there is no impact on performances.
Query results older than few hours can be used just for auditing/debugging pourpouses.
I definitely recommend to put a lifecycle rule to clean up objects older than x days, where x can be something like 3 or 7 days.
Doing so you will reduce s3 storage cost.