I am aware that some others have asked nearly the same question as this, but those discussions don’t have exactly the same scenario, nor provide the answer that I am looking for — so here we go:
We move files to azure blob storage, and check the md5 of each file before sending. To verify that the file landed to azure unchanged, we want to compare the original md5 with the md5 calculated by azure. I am doing this in python, and I get so close, yet not quite there.
Three examples of md5s, which I have counted before sending files to azure:
ORIGINAL MD5 1: 85e94a2f598c05f844976fe55b0ccfd7
ORIGINAL MD5 2: f545d96c459da41b4fcdf93d25d40612
ORIGINAL MD5 3: 92f9f9c3a9728869483ff3d4db9a3606
As I read with python the md5 field in azure blob storage, the values look promisingly a bit similar, but there are always some differences, if you take a closer look:
print(blob.content_settings.content_md5)
AZURE CONTENT_MD5 1 : bytearray(b'x85xe9J/Yx8cx05xf8Dx97oxe5[x0cxcfxd7')
AZURE CONTENT_MD5 2 : bytearray(b'xf5Exd9lEx9dxa4x1bOxcdxf9=%xd4x06x12')
AZURE CONTENT_MD5 3 : bytearray(b'x92xf9xf9xc3xa9rx88iH?xf3xd4xdbx9a6x06')
The differences become easier to see, if I clean up the value a bit:
print(str(base64.b64decode(
base64.b64encode(blob.content_settings.content_md5)
.decode())).replace("’", "").replace("x", "")[1:100])
PRINT OUTPUT 1: 85e9J/Y8c05f8D97oe5[0ccfd7
PRINT OUTPUT 2: f5Ed9lE9da41bOcdf9=%d40612
PRINT OUTPUT 3: 92f9f9c3a9r88iH?f3d4db9a606
These print outputs look very similar to the original MD5s, but they are 5-6 characters shorter. There are some small differences.
With some manual work I can make the values match perfectly, however:
print(base64.b64encode(blob.content_settings.content_md5))
ENCODED 1: b'helKL1mMBfhEl2/lWwzP1w=='
ENCODED 2: b'9UXZbEWdpBtPzfk9JdQGEg=='
ENCODED 3: b'kvn5w6lyiGlIP/PU25o2Bg=='
If I use the value inside the ‘quotes’ above, and manually copy them in the left column of this web page (option: Base64 RFC 3548, RFC 4648), that web page gives output that perfectly matches my original MD5s:
https://cryptii.com/pipes/base64-to-hex
WEB PAGE OUTPUT 1: 85 e9 4a 2f 59 8c 05 f8 44 97 6f e5 5b 0c cf d7
WEB PAGE OUTPUT 2: f5 45 d9 6c 45 9d a4 1b 4f cd f9 3d 25 d4 06 12
WEB PAGE OUTPUT 3: 92 f9 f9 c3 a9 72 88 69 48 3f f3 d4 db 9a 36 06
When spaces are removed, these are 100% matches to the values ORIGINAL MD5 1 – 3. So it has been proven that the md5 value given by azure can be logically converted to a perfect match with my original md5 values. But how to make this conversion in python (or any conversion where the two values provenly match), this mystery still haunts me.
2
Answers
Thanks to suggestions given by Venkatesan, I found this code that produces 100% similar md5 values as my original md5's:
Initially, I got the same format when reading the
md5
field in Azure blob storage using Python.You can use the below Python code to read the md5 with an exact match.
Code:
Output:
Update: