skip to Main Content

I need to name a file after its contents as a way to prevent duplicates. The files are images, and prior to saving a new file I check my media library table for duplicate names.

So I use hash_file. But I am wondering about the encryption.

md5, sha1, sha256, something else?

I know md5 is quickest and sha256 is less likely to collide, but sha256 produces very long filenames.

In the real world scenario of user uploaded images on a website, which one is best?

3

Answers


  1. md5 is fine as long as you’re not worried about malicious users generating collisions on purpose, and (using a birthday paradox calculator) the risk of a natural collision seems low enough.

    If you want to reduce this risk use a better hash. Not sure why you would care about filename length. It’s not like you’re going to manually type these.

    Login or Signup to reply.
  2. Well, I would not allow users to upload files, if they aren’t registered users and logged in. If this is the case, create a subfolder for each user with the same username which has to be unique. Then a combination of a timestamp and md5 would do it. Combine the timestamp and the md5 with an underscore between. So it’s easy to chop off the timestamp to compare for the image already being present. At least, that’s how I would do it.

    Login or Signup to reply.
  3. Good start panthro,

    You could salt the hash in a couple of ways:

    1. Like how Aranxo is surmising, add the date_created to the filename and then the hash after, maybe with or without some delineator.
    2. Keep the filename, create some sort of hash (MD5, sha256, etc), and then truncate the hash to a length where it and the filename are constant across all records.

    Point two could be more interesting, since it would stop the collision case in Aranxo’s answer where two MD5 hashes collide at the same time, making the timestamps equal (down to the lowest integer) and the MD5 hashes equal.

    HPC might see this use case for large(n) file storage requests.

    Also, point two would implicitly enforce a character limit on the filename length, which could prevent malicious or accidental failures on the system.

    Salting the hash more, with perhaps additional metadata like IP where the request was made, or even the username if the user agrees to this in a Privacy Policy.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search