List objects in S3 bucket with timestamp filter - Amazon Web Sevices

Unknowntiou
February 17, 2023
257 views
2 votes
3 Answers

Background

Is there a way to get a list of all the files on a s3 bucket that are newer than a specific timestamp. For example i am trying to figure out and get a list of all the files that got modified yesterday afternoon.

In particular i have bucket called foo-bar and inside that i have a folder called prod where the files i am trying to parse through lie.

What I am trying so far

I referred to boto3 documentation and came up with the following so far.

from boto3 import client
conn = client('s3')
conn.list_objects(Bucket='foo-bar', Prefix='prod/')['Contents']

Issues

There is two issues with this solution, the first one is it is only listing 1000 files even though i have over 10,000 files and the other is i am not sure how i filter for time?

Answers

- AlexChadyuk
- February 17, 2023 at 2:31 pm
- 0 votes
0
You can try to use S3.Paginator.ListObjects, which will return 'LastModified': datetime(2015, 1, 1) as part of object metadata in the Contents array. You can then save the object Keys into a local list based on the LastModified condition.

Login or Signup to reply.

You can filter based on timestamp doing this:

import boto3
from datetime import datetime, timedelta
from dateutil.tz import tzutc

condition_timestamp = datetime.now(tz=tzutc()) - timedelta(days=2, hours=12)  #dynamic condition
#condition_timestamp = datetime(2023, 2, 17, tzinfo=tzutc()) #Fixed condition

s3 = boto3.client('s3')

paginator = s3.get_paginator('list_objects_v2')
 
s3_filtered_list = [obj for page in paginator.paginate(Bucket="foo-bar",Prefix="prod/") for obj in page["Contents"] if obj['LastModified'] > condition_timestamp]

s3_filtered_list

Note that I give you two options to create your condition based on a timestamp… dynamic (x time from now) or fixed (x datetime)

Since the AWS S3 API doesn’t support any concept of filtering, you’ll need to filter based off of the returned objects.

Further, the list_objects and list_objects_v2 APIs only supports returning 1000 objects at a time, so you’ll need to paginate the results, calling it again and again to get all of the objects in a bucket. There is a helper method get_paginator that handles this for you.

So, you can put these two together, and get the list of all objects in a bucket, and filter them based on whatever criteria you see fit:

import boto3
from datetime import datetime, UTC

# Pick a target timestamp to filter objects on or after
# Note, it must be in UTC
target_timestamp = datetime(2023, 2, 1, tzinfo=UTC)
found_objects = []

# Create and use a paginator to list more than 1000 objects in the bucket
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=BUCKET):
    # Pull out each list of objects from each page
    for cur in page.get('Contents', []):
        # Check each object to see if it matches the target criteria
        if cur['LastModified'] >= target_timestamp:
            # If so, add it to the final list
            found_objects.append(cur)

# Just show the number of found objects in this example
print(f"Found {len(found_objects)} objects")

Please signup or login to give your own answer.

Click here to cancel reply.

List objects in S3 bucket with timestamp filter – Amazon Web Sevices

Answers