skip to Main Content

Background

Is there a way to get a list of all the files on a s3 bucket that are newer than a specific timestamp. For example i am trying to figure out and get a list of all the files that got modified yesterday afternoon.

In particular i have bucket called foo-bar and inside that i have a folder called prod where the files i am trying to parse through lie.

What I am trying so far

I referred to boto3 documentation and came up with the following so far.

from boto3 import client
conn = client('s3')
conn.list_objects(Bucket='foo-bar', Prefix='prod/')['Contents']

Issues

There is two issues with this solution, the first one is it is only listing 1000 files even though i have over 10,000 files and the other is i am not sure how i filter for time?

3

Answers


  1. You can try to use S3.Paginator.ListObjects, which will return 'LastModified': datetime(2015, 1, 1) as part of object metadata in the Contents array. You can then save the object Keys into a local list based on the LastModified condition.

    Login or Signup to reply.
  2. You can filter based on timestamp doing this:

    import boto3
    from datetime import datetime, timedelta
    from dateutil.tz import tzutc
    
    condition_timestamp = datetime.now(tz=tzutc()) - timedelta(days=2, hours=12)  #dynamic condition
    #condition_timestamp = datetime(2023, 2, 17, tzinfo=tzutc()) #Fixed condition
    
    s3 = boto3.client('s3')
    
    paginator = s3.get_paginator('list_objects_v2')
     
    s3_filtered_list = [obj for page in paginator.paginate(Bucket="foo-bar",Prefix="prod/") for obj in page["Contents"] if obj['LastModified'] > condition_timestamp]
    
    s3_filtered_list
    

    Note that I give you two options to create your condition based on a timestamp… dynamic (x time from now) or fixed (x datetime)

    Login or Signup to reply.
  3. Since the AWS S3 API doesn’t support any concept of filtering, you’ll need to filter based off of the returned objects.

    Further, the list_objects and list_objects_v2 APIs only supports returning 1000 objects at a time, so you’ll need to paginate the results, calling it again and again to get all of the objects in a bucket. There is a helper method get_paginator that handles this for you.

    So, you can put these two together, and get the list of all objects in a bucket, and filter them based on whatever criteria you see fit:

    import boto3
    from datetime import datetime, UTC
    
    # Pick a target timestamp to filter objects on or after
    # Note, it must be in UTC
    target_timestamp = datetime(2023, 2, 1, tzinfo=UTC)
    found_objects = []
    
    # Create and use a paginator to list more than 1000 objects in the bucket
    s3 = boto3.client('s3')
    paginator = s3.get_paginator('list_objects_v2')
    for page in paginator.paginate(Bucket=BUCKET):
        # Pull out each list of objects from each page
        for cur in page.get('Contents', []):
            # Check each object to see if it matches the target criteria
            if cur['LastModified'] >= target_timestamp:
                # If so, add it to the final list
                found_objects.append(cur)
    
    # Just show the number of found objects in this example
    print(f"Found {len(found_objects)} objects")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search