I am aware this can be done via notification for new objects, but in this case, there are already a lot of objects in the bucket (>4bn, under ~21M prefixes).
These objects have different sets of tags including some tags which are unnecessary.
I have written a script to check the tags on an object and remove tags that should not be there, but the object count is high enough that even running it on many fast instances will take months or more, with it enumerating one prefix at a time then checking all the objects under it.
Surprisingly, there’s no way to have Lambda invoke on every existing object. I was looking at S3 Batch but the documentation doesn’t do a very good job of explaining how objects are specified – it seems to want a CSV of all objects in the bucket to create the batch, which would seem to be too computationally expensive to reasonably produce.
Is there anything I’m overlooking here that can do this painlessly at the required scale?
2
Answers
You cannot invoke Lambdas using any AWS feature for existing objects, you can only invoke for new uploaded objects.
For the existing ones, you need to manually create a script to get all of them depending on your tags or something, put the list in CSV or something, then use S3 batch.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html
Amazon S3 Inventory can provide a daily or weekly CSV file listing all objects in a bucket.
You could write a program that then invokes an AWS Lambda function for each object. Due to the large number of existing objects, this is probably best done by sending messages to an Amazon SQS queue and then trigger the Lambda function from the queue.