Amazon web services - Boto3 download all files in a subdirectory

user3782816
March 27, 2024
208 views
0 votes
2 Answers

I am trying to use the boto3 python SDK. I have a bucket title "tmp" and I have keys that look like "my_test1/logABC1.json", "my_test1/logABC2.json", "my_test1/logABC3.json.," etc. and then gobs of other stuff that are meaningless to me What I want is to download all of the files in my my_test1 directory. This is what I tried:

counter = 1
client = boto3.client("s3") #access_keys/secrets and endpoints omitted for brevity
abc = client.list_objects(Bucket = "tmp")
for x in abc["Keys"]:
    if "my_test1" in x:
        location = "logABC"+counter+.".json"
        client.download_file("tmp", x, location)

And this was "working" as long as my tmp directory had less than 1000 items in it. Then it doesn’t work at all as list_objects returns a max of 1000 elements per the boto3 [documentation][1] and anything after that is stuck in the cloud. My question is, how do I work around this limitation? I see there is a list_objects_v2 that (technically) can start after the first 1000 keys (with some work) but am I missing something or is this my best bet? If it is my best bet, do I just write a while loop that terminates after my abc["Keys"].length is less than 1000?

As a side note, even if I make a direct call of

    client.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")

This fails to find as long as "my_test1/logABC2.json" is a key after the first 1000. I see there is such a thing as a resource and if I define

    rsce = boto3.resource("s3") #access_keys/secrets and endpoints omitted for brevity
    rsce.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")

This works even if "my_test1/logABC2.json" is not in the first 1000 keys (or at least my sample test worked anyway), but since I do not know the exact files names I am looking for this does not seem like a good option.

My question is how do you download all files in a sub_directory if the bucket is very large? I feel like I must be missing something or doing something wrong as this must come up for other people. (sub_directory used loosely as I know there is no such thing as a sub_directory of a bucket, but with proper use of delimiters you can synthetically get away with it).

Thanks for any pointers
[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects.html

Answers

You’ll need to make multiple calls to the list_objects to get each "page" of 1000 items. boto3 provides paginators to make this easier. For instance:

#!/usr/bin/env python3

import boto3

counter = 1
# Access Keys/Secrets should not be in code, use "aws configure" or instance profiles
# See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
client = boto3.client("s3")
# Create a paginator to page through the responses
paginator = client.get_paginator('list_objects')
for page in paginator.paginate(Bucket='example-bucket'):
    # Operate on each page, technically it's possible for a page
    # to not return any contents, so use .get() here to handle
    # the case where a different response occurs
    for x in page.get('Contents', []):
        # From here out, the code is the same
        if "my_test1" in x:
            location = "logABC"+counter+."json"
            client.download_file("tmp", x, location)

- JohnRotenstein
- March 22, 2024 at 2:04 am
- 0 votes
0
If you use the resource method of boto3 then it handles pagination for you:
```
import boto3

s3_resource = boto3.resource('s3')

bucket = s3_resource.Bucket('example-bucket')

for object in bucket.objects.filter(Prefix='my_test1/'):
    target_filename = object.key[object.key.rfind('/')+1:] # Remove path
    bucket.download_file(object.key, target_filename)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – Boto3 download all files in a subdirectory

Answers