Amazon web services - Python boto3: download files from s3 to local only if there are differences between s3 files and local ones

user1769197
January 5, 2025
160 views
0 votes
2 Answers

i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?

import boto3
import pathlib

BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
        
if 'Contents' in response:
     for obj in response['Contents']:
         file_key = obj['Key']
         file_name = os.path.basename(file_key)  # Get the file name from the key
         local_file_path = os.path.join(f'test_dir', file_name)
         #Download the file
         s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
         print(f"Downloaded {file_name}")

Answers

- MarioMatea
- January 4, 2025 at 5:38 am
- 0 votes
0
Based on the official documentation, list_objects_v2 returns an ample amount of information about the files stored in your bucket. The response contains elements such as the total number of keys returned and the common prefixes, but the most important is the Contents list, which stores data for each individual object inside a bucket.

From what I see, Contents has a field called LastModified of type datetime. I think you could use it to check if a file is updated or not, in order to avoid checking the actual content of the local object against the remote one (which I really don’t recommend).

I’d suggest to keep a (local) database of metadata about your S3 objects, containing elements such as Key and LastModified. Store your files locally, in a predefined folder, and make sure your files have a name that can be deduced from the database information (maybe name them after the Key).

You won’t even need to pass through the contents of the files, in order to check if a file was updated. Just query your database, query the S3 API using list_objects_v2, and check the dates for each file. If they do not correspond, download the newer version of the file.

In this manner, you could also check for missing files inside your local repository. If there are any additional keys fetched from the API, you could easily see which of them don’t exist in your database and retrieve them.

P.S. A deleted answer suggested using the operating system’s library functions to check for the file’s last modified date. It’s a great idea. But if you need performance, iterating over the metadata stored in a database could be faster than using operating system functions to read files from a directory.

Login or Signup to reply.

If you’re just concerned with downloading new files and files that have been modified remotely, you can check to see if the file exists locally, and if it does, if it has a different size or newer timestamp than the local one. That should get most cases where a remote file is added or changes.

import boto3
import os
from datetime import datetime

s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects_v2')
# Use a paginator to handle cases with more than 1000 objects
for page in paginator.paginate(Bucket=BUCKET_NAME, Prefix=KEY):
    for obj in page.get('Contents', []):
        key = obj['Key']
        # Turn the key into a name, using '/' on S3 as the path delimiter locally
        local_name = os.path.join(*key[len(KEY):].split("/"))
        local_name = os.path.join(TARGET_DIR, local_name)

        changed = False
        if not os.path.isfile(local_name):
            # The file does not exist locally
            changed = True
        elif os.path.getsize(local_name) != obj['Size']:
            # The local file is a different size
            changed = True
        elif datetime.fromtimestamp(os.path.getmtime(local_name)) < obj['LastModified'].replace(tzinfo=None):
            # The local file is older than the remote file
            changed = True

        if changed:
            if not os.path.isdir(os.path.dirname(local_name)):
                # Need to make the directory to mirror the prefix of the object
                os.makedirs(os.path.dirname(local_name))
            # Download the file
            s3_client.download_file(BUCKET_NAME, key, local_name)
            print(f"Downloaded {local_name}")

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – Python boto3: download files from s3 to local only if there are differences between s3 files and local ones

Answers