I am trying to use the boto3 python SDK. I have a bucket title "tmp" and I have keys that look like "my_test1/logABC1.json", "my_test1/logABC2.json", "my_test1/logABC3.json.," etc. and then gobs of other stuff that are meaningless to me What I want is to download all of the files in my my_test1 directory. This is what I tried:
counter = 1
client = boto3.client("s3") #access_keys/secrets and endpoints omitted for brevity
abc = client.list_objects(Bucket = "tmp")
for x in abc["Keys"]:
if "my_test1" in x:
location = "logABC"+counter+.".json"
client.download_file("tmp", x, location)
And this was "working" as long as my tmp directory had less than 1000 items in it. Then it doesn’t work at all as list_objects returns a max of 1000 elements per the boto3 [documentation][1] and anything after that is stuck in the cloud. My question is, how do I work around this limitation? I see there is a list_objects_v2 that (technically) can start after the first 1000 keys (with some work) but am I missing something or is this my best bet? If it is my best bet, do I just write a while loop that terminates after my abc["Keys"].length is less than 1000?
As a side note, even if I make a direct call of
client.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")
This fails to find as long as "my_test1/logABC2.json" is a key after the first 1000. I see there is such a thing as a resource and if I define
rsce = boto3.resource("s3") #access_keys/secrets and endpoints omitted for brevity
rsce.download_file("tmp", "my_test1/logABC2.json", "my_loc.json")
This works even if "my_test1/logABC2.json" is not in the first 1000 keys (or at least my sample test worked anyway), but since I do not know the exact files names I am looking for this does not seem like a good option.
My question is how do you download all files in a sub_directory if the bucket is very large? I feel like I must be missing something or doing something wrong as this must come up for other people. (sub_directory used loosely as I know there is no such thing as a sub_directory of a bucket, but with proper use of delimiters you can synthetically get away with it).
Thanks for any pointers
[1]: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3/client/list_objects.html
2
Answers
You’ll need to make multiple calls to the list_objects to get each "page" of 1000 items. boto3 provides paginators to make this easier. For instance:
If you use the resource method of boto3 then it handles pagination for you: