skip to Main Content

I am trying to use the boto3 python SDK. I have a bucket title "tmp" and I have keys that look like "my_test1/logABC1.json", "my_test1/logABC2.json", "my_test1/logABC3.json.," etc. and then gobs of other stuff that are meaningless to me What I want is to download all of the files in my my_test1 directory. This is what I tried:

counter = 1
client = boto3.client("s3") #access_keys/secrets and endpoints omitted for brevity
abc = client.list_objects(Bucket = "tmp")
for x in abc["Keys"]:
    if "my_test1" in x:
        location = "logABC"+counter+.".json"
        client.download_file("tmp", x, location) 

And this was "working" as long as my tmp directory had less than 1000 items in it. Then it doesn’t work at all as list_objects returns a max of 1000 elements per the boto3 [documentation][1] and anything after that is stuck in the cloud. My question is, how do I work around this limitation? I see there is a list_objects_v2 that (technically) can start after the first 1000 keys (with some work) but am I missing something or is this my best bet? If it is my best bet, do I just write a while loop that terminates after my abc["Keys"].length is less than 1000?

As a side note, even if I make a direct call of

    client.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")

This fails to find as long as "my_test1/logABC2.json" is a key after the first 1000. I see there is such a thing as a resource and if I define

    rsce = boto3.resource("s3") #access_keys/secrets and endpoints omitted for brevity
    rsce.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")

This works even if "my_test1/logABC2.json" is not in the first 1000 keys (or at least my sample test worked anyway), but since I do not know the exact files names I am looking for this does not seem like a good option.

My question is how do you download all files in a sub_directory if the bucket is very large? I feel like I must be missing something or doing something wrong as this must come up for other people. (sub_directory used loosely as I know there is no such thing as a sub_directory of a bucket, but with proper use of delimiters you can synthetically get away with it).

Thanks for any pointers
[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects.html

2

Answers


  1. You’ll need to make multiple calls to the list_objects to get each "page" of 1000 items. boto3 provides paginators to make this easier. For instance:

    #!/usr/bin/env python3
    
    import boto3
    
    counter = 1
    # Access Keys/Secrets should not be in code, use "aws configure" or instance profiles
    # See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html
    client = boto3.client("s3")
    # Create a paginator to page through the responses
    paginator = client.get_paginator('list_objects')
    for page in paginator.paginate(Bucket='example-bucket'):
        # Operate on each page, technically it's possible for a page
        # to not return any contents, so use .get() here to handle
        # the case where a different response occurs
        for x in page.get('Contents', []):
            # From here out, the code is the same
            if "my_test1" in x:
                location = "logABC"+counter+."json"
                client.download_file("tmp", x, location) 
    
    Login or Signup to reply.
  2. If you use the resource method of boto3 then it handles pagination for you:

    import boto3
    
    s3_resource = boto3.resource('s3')
    
    bucket = s3_resource.Bucket('example-bucket')
    
    for object in bucket.objects.filter(Prefix='my_test1/'):
        target_filename = object.key[object.key.rfind('/')+1:] # Remove path
        bucket.download_file(object.key, target_filename)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search