skip to Main Content

i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?

import boto3
import pathlib

BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
        
if 'Contents' in response:
     for obj in response['Contents']:
         file_key = obj['Key']
         file_name = os.path.basename(file_key)  # Get the file name from the key
         local_file_path = os.path.join(f'test_dir', file_name)
         #Download the file
         s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
         print(f"Downloaded {file_name}")

2

Answers


  1. Based on the official documentation, list_objects_v2 returns an ample amount of information about the files stored in your bucket. The response contains elements such as the total number of keys returned and the common prefixes, but the most important is the Contents list, which stores data for each individual object inside a bucket.

    From what I see, Contents has a field called LastModified of type datetime. I think you could use it to check if a file is updated or not, in order to avoid checking the actual content of the local object against the remote one (which I really don’t recommend).

    I’d suggest to keep a (local) database of metadata about your S3 objects, containing elements such as Key and LastModified. Store your files locally, in a predefined folder, and make sure your files have a name that can be deduced from the database information (maybe name them after the Key).

    You won’t even need to pass through the contents of the files, in order to check if a file was updated. Just query your database, query the S3 API using list_objects_v2, and check the dates for each file. If they do not correspond, download the newer version of the file.

    In this manner, you could also check for missing files inside your local repository. If there are any additional keys fetched from the API, you could easily see which of them don’t exist in your database and retrieve them.

    P.S. A deleted answer suggested using the operating system’s library functions to check for the file’s last modified date. It’s a great idea. But if you need performance, iterating over the metadata stored in a database could be faster than using operating system functions to read files from a directory.

    Login or Signup to reply.
  2. If you’re just concerned with downloading new files and files that have been modified remotely, you can check to see if the file exists locally, and if it does, if it has a different size or newer timestamp than the local one. That should get most cases where a remote file is added or changes.

    import boto3
    import os
    from datetime import datetime
    
    s3_client = boto3.client('s3')
    paginator = s3_client.get_paginator('list_objects_v2')
    # Use a paginator to handle cases with more than 1000 objects
    for page in paginator.paginate(Bucket=BUCKET_NAME, Prefix=KEY):
        for obj in page.get('Contents', []):
            key = obj['Key']
            # Turn the key into a name, using '/' on S3 as the path delimiter locally
            local_name = os.path.join(*key[len(KEY):].split("/"))
            local_name = os.path.join(TARGET_DIR, local_name)
    
            changed = False
            if not os.path.isfile(local_name):
                # The file does not exist locally
                changed = True
            elif os.path.getsize(local_name) != obj['Size']:
                # The local file is a different size
                changed = True
            elif datetime.fromtimestamp(os.path.getmtime(local_name)) < obj['LastModified'].replace(tzinfo=None):
                # The local file is older than the remote file
                changed = True
    
            if changed:
                if not os.path.isdir(os.path.dirname(local_name)):
                    # Need to make the directory to mirror the prefix of the object
                    os.makedirs(os.path.dirname(local_name))
                # Download the file
                s3_client.download_file(BUCKET_NAME, key, local_name)
                print(f"Downloaded {local_name}")
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search