Amazon web services - How can I keep track of which files uploaded to S3 are still being used?

ggrant
July 11, 2023
158 views
0 votes
2 Answers

I am building an online editing tool. This tool allows users to upload different types of media and files, which are stored in S3. I’m struggling to think of a robust and future-proof way to manage these files, particularly when they are no longer being used. Currently, I am using Postgres for my database. I have a table that stores all the different elements that could be used in the editor. It has regular fields that hold ids, foreign keys and other common data among all the elements. There is a JSONB field to stores all the unique data relating to an element. Every element type will have a different JSON structure, and as the application grows and more element types are added, there will be even more variations in structure. File locations/keys can be nested anywhere in the JSON object.

Here’s an example from my application: a user uploads an image, but then decides not to use it and deletes the element that contains the image. I could remove the image from S3 at that moment without any issues. However, if the user changes their mind and undoes the deletion, the image will no longer exist in S3. One potential solution could be to store the image locally and then re-upload it. While this might be a good solution, it seems complex to implement and could result in latency and redundant file uploads. If I don’t delete the file and the user ends up deleting the element and never undoing that action the file is now lost in S3 and my application has no way of knowing it isn’t being used or that it even exists.

I’m curious about how applications like Figma handle this problem.

I’ve considered various solutions, but I’m not satisfied with the one I’m currently using. In my current approach, every element in the database has a field containing an array of all the keys that have been uploaded. When a file is uploaded, its key is added to this array, and if it is deleted, the key is removed. Once the element is saved, the backend checks for changes in this array. If any keys have been removed, a record with those keys is created so that a job can be run later to delete all the unused files if they haven’t been accessed for over a week. I believe this approach is quite fragile because every time a new element type is added, we need to remember to add and remove uploaded file keys, which could easily be forgotten. There are numerous touchpoints and processes that need to be followed for this approach to work properly. I need a simple and robust way to manage these uploaded files so that unnecessary files don’t accumulate in S3.

Please let me know if you need any clarifications. Thanks.

Answers

- YashVanzara
- July 11, 2023 at 2:23 am
- 0 votes
0
This looks like a very good use case of leveraging AWS S3 Lifecycle Policies along with object tagging. When a user uploads an object to the canvas/element, you can upload the object to S3 + Tag it (let’s say tagged as Referenced = True) using PutObject API and continue to store the json as you are storing currently. If the user chooses to undo the operation, you can use a similar API to delete/update the object Tag (let’s say re-tagged as Referenced = False)

Now, based on your application requirements, you can go ahead and setup a S3 lifecycle policy that says – After 7 days, delete all the objects in the bucket that have tag Referenced = False

Here’s a sample S3 Lifecycle Configuration you can setup on your S3 bucket
```
<LifecycleConfiguration>
  <Rule>
    <ID>Rule 1</ID>
    <Filter>
      <Tag>
         <Key>Referenced</Key>
         <Value>False</Value>
      </Tag>
    </Filter>
    <Status>Enabled</Status>
    <Expiration>
      <Days>7</Days>
    </Expiration>
  </Rule>
</LifecycleConfiguration>
```
With this setup, you no longer need to maintain a separate array/state for files to be deleted and delegate the object lifecycle management to AWS S3.
Login or Signup to reply.

- JohnRotenstein
- July 11, 2023 at 4:58 am
- 0 votes
0
It might be easier to:
- Perform a weekly or monthly scan of the data to make a list of all referenced S3 objects
- Then, delete any objects older than a month if they are not referenced in the list
  - Or, keep the list and compare it to the next list that is produced. Only delete objects if they are not referenced on both lists (this week/month and last week/month).
You can obtain a list of objects currently in the bucket by using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. This way, you don’t need to scan all the objects in S3 — just compare the Inventory report against the list of referenced objects.

The cost of storing "unreferenced" objects in S3 is not high, so there is no urgency to delete objects. Waiting a month to delete them will not be a large cost burden (compared with never deleting objects).
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Amazon web services – How can I keep track of which files uploaded to S3 are still being used?

Answers