skip to Main Content

Context: I have 2 buckets, bucket A and bucket B. Bucket A had all of its contents placed in Bucket B via the aws s3 sync CLI command.

Problem: I want to delete all the items in bucket B that also exist in Bucket A, without deleting anything in Bucket A.

E.g.

Bucket A (Source):

  1. File R
  2. File G
  3. File C

Bucket B (Destination):

  1. File A
  2. File R
  3. File G
  4. File C
  5. File O

^^ I need to delete all files in the target destination which do exist in the source destination, so only files R, G, and C need to be deleted from Bucket B.

Attempted Solution: The aws s3 sync CLI command includes the flag --delete. However, this flag only ensures that any files in the target destination that aren’t in the source destination are deleted.

Is there any way to do this using aws s3 sync?

2

Answers


  1. Chosen as BEST ANSWER

    I ended up solving this via s3 bucket lifecycle rules where I specified a regex pattern that matched against the necessary files in both buckets.

    Using S3 lifecycle rules allows any number of files (in the millions even) to be deleted at midnight UTC and does not incur a cost, unlike the aws s3 cli which needs to list objects in order to programmatically delete them (in batches of 1000 at a time)


  2. For a one-off cleanup, if the number of files to delete is small, use the sequence of shell-commands below. Run it with the --dryrun flag first, and then, if the output looks as expected, without the flag.

    aws s3 ls s3://bucket_a/ | tr -s ' ' | cut -d' ' -f 4 | xargs -t -I % aws s3 rm --dryrun s3://bucket_b/%
    

    Explanation of each command is below.

    aws s3 ls s3://bucket_a/ # List files in the original location.
    tr -s ' ' # Remove duplicate whitespaces for the cut command.
    cut -d' ' -f 4 # Extract file names.
    xargs -t -I % aws s3 rm --dryrun s3://bucket_b/% # Execute rm for each file name in the location with the copies.
    

    Alternatively, if the number of files is large, use the the batch delele command below. Please note that the batch command does not have the --dryrun option.

    aws s3 ls s3://bucket_a/ | tr -s ' ' | cut -d' ' -f 4 | xargs -I % printf '{Key=%}n' | xargs -n1000 echo | tr ' ' , | xargs -t -I % aws s3api delete-objects --bucket bucket_b --delete 'Objects=[%]'
    

    Explanation of the command is below.

    xargs -I % printf '{Key=%}n' # Format S3 keys for the batch command.
    xargs -n1000 echo # Group the keys in batches, 1000 keys each.
    tr ' ' , # Format the groups for the batch command. 
    xargs -t -I % aws s3api delete-objects --bucket bucket_b --delete 'Objects=[%]' # Delete the files batch by batch. 
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search