How to decode azure blob md5 in python: I get so near, yet not quite there

Jomitt
June 14, 2023
173 views
1 vote
2 Answers

I am aware that some others have asked nearly the same question as this, but those discussions don’t have exactly the same scenario, nor provide the answer that I am looking for — so here we go:

We move files to azure blob storage, and check the md5 of each file before sending. To verify that the file landed to azure unchanged, we want to compare the original md5 with the md5 calculated by azure. I am doing this in python, and I get so close, yet not quite there.

Three examples of md5s, which I have counted before sending files to azure:

ORIGINAL MD5 1: 85e94a2f598c05f844976fe55b0ccfd7
ORIGINAL MD5 2: f545d96c459da41b4fcdf93d25d40612
ORIGINAL MD5 3: 92f9f9c3a9728869483ff3d4db9a3606

As I read with python the md5 field in azure blob storage, the values look promisingly a bit similar, but there are always some differences, if you take a closer look:
print(blob.content_settings.content_md5)

AZURE CONTENT_MD5 1 : bytearray(b'x85xe9J/Yx8cx05xf8Dx97oxe5[x0cxcfxd7')
AZURE CONTENT_MD5 2 : bytearray(b'xf5Exd9lEx9dxa4x1bOxcdxf9=%xd4x06x12')
AZURE CONTENT_MD5 3 : bytearray(b'x92xf9xf9xc3xa9rx88iH?xf3xd4xdbx9a6x06')

The differences become easier to see, if I clean up the value a bit:

print(str(base64.b64decode(
base64.b64encode(blob.content_settings.content_md5)
.decode())).replace("’", "").replace("x", "")[1:100])

PRINT OUTPUT 1: 85e9J/Y8c05f8D97oe5[0ccfd7
PRINT OUTPUT 2: f5Ed9lE9da41bOcdf9=%d40612
PRINT OUTPUT 3: 92f9f9c3a9r88iH?f3d4db9a606

These print outputs look very similar to the original MD5s, but they are 5-6 characters shorter. There are some small differences.

With some manual work I can make the values match perfectly, however:
print(base64.b64encode(blob.content_settings.content_md5))

ENCODED 1: b'helKL1mMBfhEl2/lWwzP1w=='
ENCODED 2: b'9UXZbEWdpBtPzfk9JdQGEg=='
ENCODED 3: b'kvn5w6lyiGlIP/PU25o2Bg=='

If I use the value inside the ‘quotes’ above, and manually copy them in the left column of this web page (option: Base64 RFC 3548, RFC 4648), that web page gives output that perfectly matches my original MD5s:
https://cryptii.com/pipes/base64-to-hex

WEB PAGE OUTPUT 1: 85 e9 4a 2f 59 8c 05 f8 44 97 6f e5 5b 0c cf d7
WEB PAGE OUTPUT 2: f5 45 d9 6c 45 9d a4 1b 4f cd f9 3d 25 d4 06 12
WEB PAGE OUTPUT 3: 92 f9 f9 c3 a9 72 88 69 48 3f f3 d4 db 9a 36 06

When spaces are removed, these are 100% matches to the values ORIGINAL MD5 1 – 3. So it has been proven that the md5 value given by azure can be logically converted to a perfect match with my original md5 values. But how to make this conversion in python (or any conversion where the two values provenly match), this mystery still haunts me.

Answers

Chosen as BEST ANSWER
- Jomitt
- June 7, 2023 at 12:52 pm
- 0 votes
0
Thanks to suggestions given by Venkatesan, I found this code that produces 100% similar md5 values as my original md5's:
```
 blobmd5 = bytearray(blob.content_settings.content_md5)
 hex = binascii.hexlify(blobmd5).decode('utf-8') 
 print(hex)
```

(Edit)

- Venkatesan
- June 7, 2023 at 10:33 am
- 0 votes
0
Initially, I got the same format when reading the md5 field in Azure blob storage using Python.

Blob name: abcd.vhd
bytearray(b’xf3x97xbdxe1xadx9fxe7xdfx1cxd3x97xfcxe3x8f{xe9xf7xb9xe5xbdx1cqxf7{‘)
Blob name: student1.json
bytearray(b’x7fx9e9wxdex9cxe3x9f]kx8d[xe1xf7x1dx7fxddxddxdbx97xxd3xadv’)

But how to make this conversion in Python (or any conversion where the two values provenly match)?

You can use the below Python code to read the md5 with an exact match.

Code:
```
from azure.storage.blob import BlobServiceClient
import base64

connection_string = 'Your connection string'
blob_service_client = BlobServiceClient.from_connection_string(connection_string)


container_name = 'test1'
container_client = blob_service_client.get_container_client(container_name)

blobs = container_client.list_blobs()

for blob in blobs:
    blob_client = container_client.get_blob_client(blob.name)
    content_settings = blob_client.get_blob_properties().content_settings
    print('Blob name:', blob.name)
    blobmd5 = bytearray(content_settings.content_md5)
    decode = base64.b64encode(blobmd5).decode('utf-8')
    hex = bytes.fromhex(decode).hex()
    print(hex)
```
Output:
```
Blob name: abcd.vhd
85e94a2f598c05f844976fe55b0ccfd7
Blob name: student1.json
f545d96c459da41b4fcdf93d25d40612
```
Update:
```
  import binascii

  blobmd5 = bytearray(content_settings.content_md5)
  hex = binascii.hexlify(blobmd5).decode('utf-8')
  print(hex) 
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.