I need to store remotely some chunks of data and compare them too see if there are duplications.
I will compile a specific C program and I would like to compress this chuncks with GZIP.
My doubt is: if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?
Target PC/Servers could be with different Linux OSs like Ubuntu/CentOs/Debian, etc.
May I force same result by statically linking a specific gzip library?
3
Answers
That’s not enough, you also need the same compression level at the very least, as well as any other options your particular properties your library might have (usually it’s just the level).
If you use the same version of the library and the same compression level, then it’s likely that the output is identical (or stable, as you call it). That’s not a very strong guarantee however, I’d recommend using a hashing function instead, that’s what they’re meant for.
While it may be true in the majority of the cases, I don’t think you can safely make this assumption. The compressed output can differ depending on the default compression level and coding used by the library. For example the GNU
gzip
tool uses LZ77 and OpenBSDgzip
uses compress (according to Wikipedia). I don’t know if this difference comes from different libraries or different configurations of the same library, but nonetheless I would really avoid assuming that a generic chunk of gzipped data is exactly the same when compressed using different implementations.Yes, this could be a solution. Using the same version of the same library with the same configuration across different systems would give you the same compressed output.
You could also avoid this problem in other ways:
No, not unless you are 100% certain you are using exactly the same version of the same source code with the same settings, and that you have disabled the modified timestamp in the gzip header.
It’s not clear what you’re doing with these compressed chunks, but if the idea is to have less to transmit and compare, then you can do far better with a hash. Use a SHA-256 on your uncompressed chunks, and then you can transmit and compare those in no time. The probability of an accidental match is so infinitesimally small, you’d have to wait for all the stars to go out to see one such occurrence.