i have the following code that download files from s3 to local. However, i cannot figure out how to download only if s3 files are different from and more updated than the local ones. What is the best way to do this ? Is it based on modified time or ETags or MD5 or all of these?
import boto3
import pathlib
BUCKET_NAME = 'testing'
s3_client = boto3.client('s3')
response = s3_client.list_objects_v2(Bucket = BUCKET_NAME, Prefix = KEY)
if 'Contents' in response:
for obj in response['Contents']:
file_key = obj['Key']
file_name = os.path.basename(file_key) # Get the file name from the key
local_file_path = os.path.join(f'test_dir', file_name)
#Download the file
s3_client.download_file(BUCKET_NAME, file_key, local_file_path)
print(f"Downloaded {file_name}")
2
Answers
Based on the official documentation, list_objects_v2 returns an ample amount of information about the files stored in your bucket. The response contains elements such as the total number of keys returned and the common prefixes, but the most important is the
Contents
list, which stores data for each individual object inside a bucket.From what I see,
Contents
has a field calledLastModified
of typedatetime
. I think you could use it to check if a file is updated or not, in order to avoid checking the actual content of the local object against the remote one (which I really don’t recommend).I’d suggest to keep a (local) database of metadata about your S3 objects, containing elements such as
Key
andLastModified
. Store your files locally, in a predefined folder, and make sure your files have a name that can be deduced from the database information (maybe name them after theKey
).You won’t even need to pass through the contents of the files, in order to check if a file was updated. Just query your database, query the S3 API using list_objects_v2, and check the dates for each file. If they do not correspond, download the newer version of the file.
In this manner, you could also check for missing files inside your local repository. If there are any additional keys fetched from the API, you could easily see which of them don’t exist in your database and retrieve them.
P.S. A deleted answer suggested using the operating system’s library functions to check for the file’s last modified date. It’s a great idea. But if you need performance, iterating over the metadata stored in a database could be faster than using operating system functions to read files from a directory.
If you’re just concerned with downloading new files and files that have been modified remotely, you can check to see if the file exists locally, and if it does, if it has a different size or newer timestamp than the local one. That should get most cases where a remote file is added or changes.