skip to Main Content

I need to store remotely some chunks of data and compare them too see if there are duplications.
I will compile a specific C program and I would like to compress this chuncks with GZIP.

My doubt is: if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?

Target PC/Servers could be with different Linux OSs like Ubuntu/CentOs/Debian, etc.

May I force same result by statically linking a specific gzip library?

3

Answers


  1. May I force same result by statically linking a specific gzip library?

    That’s not enough, you also need the same compression level at the very least, as well as any other options your particular properties your library might have (usually it’s just the level).

    If you use the same version of the library and the same compression level, then it’s likely that the output is identical (or stable, as you call it). That’s not a very strong guarantee however, I’d recommend using a hashing function instead, that’s what they’re meant for.

    Login or Signup to reply.
  2. if I compress the same chunk of data with the same C program using a gzip library on different computers, will it give the exact same result or could it give different compressed results?

    While it may be true in the majority of the cases, I don’t think you can safely make this assumption. The compressed output can differ depending on the default compression level and coding used by the library. For example the GNU gzip tool uses LZ77 and OpenBSD gzip uses compress (according to Wikipedia). I don’t know if this difference comes from different libraries or different configurations of the same library, but nonetheless I would really avoid assuming that a generic chunk of gzipped data is exactly the same when compressed using different implementations.

    May I force same result by statically linking a specific gzip library?

    Yes, this could be a solution. Using the same version of the same library with the same configuration across different systems would give you the same compressed output.


    You could also avoid this problem in other ways:

    1. Perform the compression on the server, and only send uncompressed data (this is probably not a good solution as sending uncompressed data is slow).
    2. Use hashes of the uncompressed data, store them on the server and check them by making the client send an hash first, and then the compressed data in case the server says the hash doesn’t match (i.e. the chunk is not a duplicate). This also has the advantage of only needing to check the hash (and avoiding compression altogether if the hash matches).
    3. Similar to option 2, use hashes of the uncompressed data, but always send compressed data to the server. The server then does decompression (which can be easily done in memory using a relatively small buffer) and hashes the uncompressed data to check if the received chunk is a duplicate before storing it.
    Login or Signup to reply.
  3. No, not unless you are 100% certain you are using exactly the same version of the same source code with the same settings, and that you have disabled the modified timestamp in the gzip header.

    It’s not clear what you’re doing with these compressed chunks, but if the idea is to have less to transmit and compare, then you can do far better with a hash. Use a SHA-256 on your uncompressed chunks, and then you can transmit and compare those in no time. The probability of an accidental match is so infinitesimally small, you’d have to wait for all the stars to go out to see one such occurrence.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search