Background
Is there a way to get a list of all the files on a s3 bucket that are newer than a specific timestamp. For example i am trying to figure out and get a list of all the files that got modified yesterday afternoon.
In particular i have bucket called foo-bar
and inside that i have a folder called prod
where the files i am trying to parse through lie.
What I am trying so far
I referred to boto3 documentation and came up with the following so far.
from boto3 import client
conn = client('s3')
conn.list_objects(Bucket='foo-bar', Prefix='prod/')['Contents']
Issues
There is two issues with this solution, the first one is it is only listing 1000 files even though i have over 10,000 files and the other is i am not sure how i filter for time?
3
Answers
You can try to use S3.Paginator.ListObjects, which will return
'LastModified': datetime(2015, 1, 1)
as part of object metadata in theContents
array. You can then save the objectKey
s into a local list based on theLastModified
condition.You can filter based on timestamp doing this:
Note that I give you two options to create your condition based on a timestamp… dynamic (x time from now) or fixed (x datetime)
Since the AWS S3 API doesn’t support any concept of filtering, you’ll need to filter based off of the returned objects.
Further, the
list_objects
andlist_objects_v2
APIs only supports returning 1000 objects at a time, so you’ll need to paginate the results, calling it again and again to get all of the objects in a bucket. There is a helper methodget_paginator
that handles this for you.So, you can put these two together, and get the list of all objects in a bucket, and filter them based on whatever criteria you see fit: