There is this S3 notification feature described here:
Amazon S3 event notifications are designed to be delivered at least once. Typically, event notifications are delivered in seconds but can sometimes take a minute or longer.
and discussed here.
I thought I could mitigate the duplications a bit by deleting files I have already processed. The problem is, when a second event to the same file comes (a minute later) and I try to access the file, I don’t get an HTTP 404, I get an ugly AccessDenied:
[ERROR] ClientError: An error occurred (AccessDenied) when calling the GetObject operation: Access Denied
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 111, in lambda_handler
raise e
File "/var/task/lambda_function.py", line 104, in lambda_handler
response = s3.get_object(Bucket=bucket, Key=key)
File "/var/runtime/botocore/client.py", line 391, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/var/runtime/botocore/client.py", line 719, in _make_api_call
raise error_class(parsed_response, operation_name)
which is unexpected and not acceptable.
I don’t want my lambda to suppress AccessDenied errors for obvious reasons. Is there an easy way to find out if the file has been already processed in the past or if notification service is playing tricks?
EDIT:
For those who think this is "an indication of some bug in my application" here the relevant piece of code:
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
logger.info(f'Requesting file from bucket {bucket} with key {key}')
try:
response = s3.get_object(Bucket=bucket, Key=key)
except ClientError as e:
error_code = e.response["Error"]["Code"]
if error_code == 'NoSuchKey':
logger.info('Object does not exist any more')
return
else:
raise e
It rather smells like an ugly issue on AWS side to me.
2
Answers
You will need to inspect the error code by loading the object using the s3 resource Object to see whether it’s a 404. This way you can distinguish between a 404 and a 403 for instance and conclude whether the file has already been deleted in the meantime.
EDIT:
Apologies I misread the question.
In that case I would just implement idempotency in the processor to make sure you only process each file once.
See for example:
On the duplicate delivery of notifications, yes this can happen as documented but is relatively rare:
One possible mechanism to deal with this is to build an idempotent workflow, for example that utilizes DynamoDB to record actions against an object at a given time that can be queried to prevent duplicate workflow on the same object. There are a number of idempotency features in the AWS Lambda PowerTools suite or third-party options that you might consider.
More discussion on the duplicate event topic can be found here.
On the AccessDenied error when attempting to download an absent object that you have GetObject permission for, this is actually a security feature designed to prevent the leakage of information. If you have ListBucket permission then you will get a 404 Not Found response indicating the absence of the object; if you don’t have ListBucket then you will get a 403 Forbidden response. To correct this, add
s3:ListBucket
onarn:aws:s3:::mybucket
to your IAM policy.More discussion on the AccessDenied topic can be found here.