I need to name a file after its contents as a way to prevent duplicates. The files are images, and prior to saving a new file I check my media library table for duplicate names.
So I use hash_file
. But I am wondering about the encryption.
md5, sha1, sha256, something else?
I know md5 is quickest and sha256 is less likely to collide, but sha256 produces very long filenames.
In the real world scenario of user uploaded images on a website, which one is best?
3
Answers
md5
is fine as long as you’re not worried about malicious users generating collisions on purpose, and (using a birthday paradox calculator) the risk of a natural collision seems low enough.If you want to reduce this risk use a better hash. Not sure why you would care about filename length. It’s not like you’re going to manually type these.
Well, I would not allow users to upload files, if they aren’t registered users and logged in. If this is the case, create a subfolder for each user with the same username which has to be unique. Then a combination of a timestamp and md5 would do it. Combine the timestamp and the md5 with an underscore between. So it’s easy to chop off the timestamp to compare for the image already being present. At least, that’s how I would do it.
Good start panthro,
You could salt the hash in a couple of ways:
Point two could be more interesting, since it would stop the collision case in Aranxo’s answer where two MD5 hashes collide at the same time, making the timestamps equal (down to the lowest integer) and the MD5 hashes equal.
HPC might see this use case for large(n) file storage requests.
Also, point two would implicitly enforce a character limit on the filename length, which could prevent malicious or accidental failures on the system.
Salting the hash more, with perhaps additional metadata like IP where the request was made, or even the username if the user agrees to this in a Privacy Policy.