skip to Main Content

I have millions of s3 objects, now i wanted to add tags to those objects. if prefix have 111 keyword then it should add related tags to all objects under that keyword prefix. like showed below

keys               tags
-----             ------
abc/111/          key=1year, value=yes
def/111/          key=1year, value=yes
def/222/          key=2year, value=yes
....               ..........

I have implemented this with lambda function and it is working fine but that bucket contains millions of objects so it’s failing due to timeout error. So I tried to choose batch operation job but i don’t see any option to create and run s3 batch operation job as code.

Finally I am into step functions. I don’t have much idea on step functions, Is there any solution to achive this through step functions. I have a requirement to complete this with step functions.

Any help would be appreciated.

2

Answers


  1. Good Morning, I don’t have much idea about the step function (I will check and reply soon).

    Regarding Batch operations, you can’t add new tags, but you can delete or modify the tags. You can also invoke lambda based on the manifest file, which is created via the inventory rule.

    You can specify the small about of object in manifest file and invoke lambda, i wish it wont give any timeout.

    But here the issue will be Manual intervention, which can also be overcome via automation (Lambda). Create a lambda using the Boto3 library and create a manifest file. Once the file is created, it may take up to 24 hours. Trigger the tagging lambda with these manifests (One manifest at a time)

    aws s3 batch operation

    Login or Signup to reply.
  2. To do this with AWS Step Functions, you would want to use the Distributed Map feature. See the example below that demonstrates doing this.

    The state machine uses Distributed Map to read items from S3 and send to an Express Workflows Item Processor in Batches of 1,000 with max concurrency of 5. Using Express for the Item Processor is key here because the number of state transitions to make all of these calls to S3 would add up to a lot of State Transitions (cost) that isn’t required because this is an idempotent operation. So it’s a great fit for Express Workflows. And this shows how helpful it is with Distributed Map to compose Standard (for the outer part of the workflow) with Express to get the right price-performance characteristics.

    In the Express Workflows Item Processor, the state machine uses Inline Map to process each batch of items. And for each item, it calls PutObjectTagging. The max concurrency here is set to 5, so combined with the max concurrency of 5 on the Distributed Map, your concurrent calls to S3 will be around 25. That should keep your call rate within acceptable limits and you can adjust to tune up or down as required.

    enter image description here

    {
      "Comment": "A state machine to bulk tag objects from S3 using Distributed Map",
      "StartAt": "Confirm Bucket Provided",
      "States": {
        "Confirm Bucket Provided": {
          "Type": "Choice",
          "Choices": [
            {
              "Not": {
                "Variable": "$.bucket",
                "IsPresent": true
              },
              "Next": "Fail - No Bucket"
            }
          ],
          "Default": "Check for Prefix"
        },
        "Check for Prefix": {
          "Type": "Choice",
          "Choices": [
            {
              "Not": {
                "Variable": "$.prefix",
                "IsPresent": true
              },
              "Next": "Generate Parameters - Without Prefix"
            }
          ],
          "Default": "Generate Parameters - With Prefix"
        },
        "Generate Parameters - Without Prefix": {
          "Type": "Pass",
          "Parameters": {
            "Bucket.$": "$.bucket",
            "Prefix": ""
          },
          "ResultPath": "$.list_parameters",
          "Next": "Tag Objects in S3 Bucket"
        },
        "Fail - No Bucket": {
          "Type": "Fail",
          "Error": "InsuffcientArguments",
          "Cause": "No Bucket was provided"
        },
        "Generate Parameters - With Prefix": {
          "Type": "Pass",
          "Next": "Tag Objects in S3 Bucket",
          "Parameters": {
            "Bucket.$": "$.bucket",
            "Prefix.$": "$.prefix"
          },
          "ResultPath": "$.list_parameters"
        },
        "Tag Objects in S3 Bucket": {
          "Type": "Map",
          "ItemProcessor": {
            "ProcessorConfig": {
              "Mode": "DISTRIBUTED",
              "ExecutionType": "EXPRESS"
            },
            "StartAt": "Tag Objects",
            "States": {
              "Tag Objects": {
                "Type": "Map",
                "ItemProcessor": {
                  "ProcessorConfig": {
                    "Mode": "INLINE"
                  },
                  "StartAt": "PutObjectTagging",
                  "States": {
                    "PutObjectTagging": {
                      "Type": "Task",
                      "Parameters": {
                        "Bucket.$": "$.Bucket",
                        "Key.$": "$.Key",
                        "Tagging": {
                          "TagSet": [
                            {
                              "Key": "mykey",
                              "Value": "myvalue"
                            }
                          ]
                        }
                      },
                      "Resource": "arn:aws:states:::aws-sdk:s3:putObjectTagging",
                      "End": true,
                      "Retry": [
                        {
                          "ErrorEquals": [
                            "States.ALL"
                          ],
                          "BackoffRate": 2,
                          "IntervalSeconds": 1,
                          "MaxAttempts": 3
                        }
                      ]
                    }
                  }
                },
                "ItemsPath": "$.Items",
                "ResultPath": "$.object_identifiers",
                "Next": "Clear Output",
                "ItemSelector": {
                  "Key.$": "$$.Map.Item.Value.Key",
                  "Bucket.$": "$.BatchInput.bucket"
                },
                "MaxConcurrency": 5
              },
              "Clear Output": {
                "Type": "Pass",
                "End": true,
                "Result": {}
              }
            }
          },
          "ItemReader": {
            "Resource": "arn:aws:states:::s3:listObjectsV2",
            "Parameters": {
              "Bucket.$": "$.list_parameters.Bucket",
              "Prefix.$": "$.list_parameters.Prefix"
            }
          },
          "MaxConcurrency": 5,
          "Label": "S3objectkeys",
          "ItemBatcher": {
            "MaxInputBytesPerBatch": 204800,
            "MaxItemsPerBatch": 1000,
            "BatchInput": {
              "bucket.$": "$.list_parameters.Bucket"
            }
          },
          "ResultSelector": {},
          "End": true
        }
      }
    }
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search