skip to Main Content

I have this path in S3: object1/object2/object3/object4/

In Object4/ I have a list of objects, example:

directory1/directory2/directory3/directory4/2022-30-09-15h21/

directory1/directory2/directory3/directory4/2023-20-12-12h30/

directory1/directory2/directory3/directory4/2022-31-12-09h34/

directory1/directory2/directory3/directory4/2023-12-08-14h56/

I would like to select the last created directory in directory4/ then I should download all files inside it.

I wrote this script to do it:

import boto3
from datetime import datetime 

session_root = boto3.Session(region_name='eu-west-3', profile_name='my_profile')
s3_client = session_root.client('s3') 

bucket_name = 'my_bucket' 

prefix = 'object1/object2/object3/object4/'

# List objects in the bucket 
response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix=prefix) 

# Extract the object names and convert them to datetime objects 

objects_with_dates = [(obj['Key'], datetime.strptime(obj['LastModified'].strftime('%Y-%m-%d %H:%M:%S'), '%Y-%m-%d %H:%M:%S')) for obj in response.get('Contents', [])] 
# Find the latest created object 

latest_object = max(objects_with_dates, key=lambda x: x[1]) 

print("Last created S3 object:", latest_object[0]) # the returned value is: object1/object2/object3/object4/2023-20-12-12h30/my_file.csv

My script select the last created directory in directory4/ and download the last created file inside, the result of my script is: directory1/directory2/directory3/directory4/2023-20-12-12h30/my_file.csv

But I would like to download all files inside.

Do you have an idea please how can I modify my script to select the last created directory in directory4/ and I download all files inside ?

Thanks

2

Answers


  1. A way to select the last created object into your S3 Buckets will be to create DynamoDB and use a Lambda with S3 Object Lambda to save a catalog into DynamoDB and place the index on the modified/change time.

    Ofc you can use an other database then DynamoDB but DynamoDB is very cheap to start with and later you can think about what makes sens by changing DB, DynamoDB only costs when you use it if you use that option.

    It’s a little bit more complex than you asked for, but if you have 100.000.000 objects in your S3 you will need to pay for each list scan and object lookup so it can be very expensive if you make mistakes so I will recommend you to use S3 Object Lambda ( https://aws.amazon.com/s3/features/object-lambda/ )

    Login or Signup to reply.
  2. It appears that your requirement is:

    • List all sub-directories for a given prefix (eg all sub-directories under directory1/directory2/directory3/directory4/)
    • Of those sub-directories, find the sub-directory that represents the latest date by using the name of the subdirectory that includes a timestamp in YYYY-DD-MM-HHhmm format
    • Download all the objects in that sub-directory

    Here is a sample program that uses the list of CommonPrefixes returned by S3, which is effectively a list of sub-directories.

    import boto3
    
    BUCKET = 'my-bucket'
    PREFIX = 'directory1/directory2/directory3/directory4/'
    
    # Custom date sorter to handle YYYY-DD-MM-HHhmm format
    def date_sorter(date):
        date_parts = date.split('-')
        return (date_parts[0], date_parts[2], date_parts[1], date_parts[3])
    
    
    # Obtain a list of CommonPrefixes in the given Bucket and Prefix
    # Use a paginator in case there are more than 1000 objects
    s3_client = boto3.client('s3')
    paginator = s3_client.get_paginator('list_objects_v2')
    result = paginator.paginate(Bucket=BUCKET, Delimiter='/', Prefix=PREFIX)
    
    # Get the 'latest' CommonPrefix but it is in the format YYYY-DD-MM-HHhmm
    prefixes = [item['Prefix'] for item in result.search('CommonPrefixes')]
    latest_prefix = sorted(prefixes, key=date_sorter)[-1]
    
    # Download all objects from that prefix
    s3_resource = boto3.resource('s3')
    for object in s3_resource.Bucket(BUCKET).objects.filter(Prefix=latest_prefix):
        # Download to local directory using just the filename
        filename = object.key.split('/')[-1]
        print(f'Downloading {object.key}')
        object.Object().download_file(filename)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search