skip to Main Content

I am aware that some others have asked nearly the same question as this, but those discussions don’t have exactly the same scenario, nor provide the answer that I am looking for — so here we go:

We move files to azure blob storage, and check the md5 of each file before sending. To verify that the file landed to azure unchanged, we want to compare the original md5 with the md5 calculated by azure. I am doing this in python, and I get so close, yet not quite there.

Three examples of md5s, which I have counted before sending files to azure:

ORIGINAL MD5 1: 85e94a2f598c05f844976fe55b0ccfd7
ORIGINAL MD5 2: f545d96c459da41b4fcdf93d25d40612
ORIGINAL MD5 3: 92f9f9c3a9728869483ff3d4db9a3606

As I read with python the md5 field in azure blob storage, the values look promisingly a bit similar, but there are always some differences, if you take a closer look:
print(blob.content_settings.content_md5)

AZURE CONTENT_MD5 1 : bytearray(b'x85xe9J/Yx8cx05xf8Dx97oxe5[x0cxcfxd7')
AZURE CONTENT_MD5 2 : bytearray(b'xf5Exd9lEx9dxa4x1bOxcdxf9=%xd4x06x12')
AZURE CONTENT_MD5 3 : bytearray(b'x92xf9xf9xc3xa9rx88iH?xf3xd4xdbx9a6x06')

The differences become easier to see, if I clean up the value a bit:

print(str(base64.b64decode(
base64.b64encode(blob.content_settings.content_md5)
.decode())).replace("’", "").replace("x", "")[1:100])

PRINT OUTPUT 1: 85e9J/Y8c05f8D97oe5[0ccfd7
PRINT OUTPUT 2: f5Ed9lE9da41bOcdf9=%d40612
PRINT OUTPUT 3: 92f9f9c3a9r88iH?f3d4db9a606

These print outputs look very similar to the original MD5s, but they are 5-6 characters shorter. There are some small differences.

With some manual work I can make the values match perfectly, however:
print(base64.b64encode(blob.content_settings.content_md5))

ENCODED 1: b'helKL1mMBfhEl2/lWwzP1w=='
ENCODED 2: b'9UXZbEWdpBtPzfk9JdQGEg=='
ENCODED 3: b'kvn5w6lyiGlIP/PU25o2Bg=='

If I use the value inside the ‘quotes’ above, and manually copy them in the left column of this web page (option: Base64 RFC 3548, RFC 4648), that web page gives output that perfectly matches my original MD5s:
https://cryptii.com/pipes/base64-to-hex

WEB PAGE OUTPUT 1: 85 e9 4a 2f 59 8c 05 f8 44 97 6f e5 5b 0c cf d7
WEB PAGE OUTPUT 2: f5 45 d9 6c 45 9d a4 1b 4f cd f9 3d 25 d4 06 12
WEB PAGE OUTPUT 3: 92 f9 f9 c3 a9 72 88 69 48 3f f3 d4 db 9a 36 06

When spaces are removed, these are 100% matches to the values ORIGINAL MD5 1 – 3. So it has been proven that the md5 value given by azure can be logically converted to a perfect match with my original md5 values. But how to make this conversion in python (or any conversion where the two values provenly match), this mystery still haunts me.

2

Answers


  1. Chosen as BEST ANSWER

    Thanks to suggestions given by Venkatesan, I found this code that produces 100% similar md5 values as my original md5's:

     blobmd5 = bytearray(blob.content_settings.content_md5)
     hex = binascii.hexlify(blobmd5).decode('utf-8') 
     print(hex)
    

  2. Initially, I got the same format when reading the md5 field in Azure blob storage using Python.

    Blob name: abcd.vhd
    bytearray(b’xf3x97xbdxe1xadx9fxe7xdfx1cxd3x97xfcxe3x8f{xe9xf7xb9xe5xbdx1cqxf7{‘)
    Blob name: student1.json
    bytearray(b’x7fx9e9wxdex9cxe3x9f]kx8d[xe1xf7x1dx7fxddxddxdbx97xxd3xadv’)

    enter image description here

    But how to make this conversion in Python (or any conversion where the two values provenly match)?

    You can use the below Python code to read the md5 with an exact match.

    Code:

    from azure.storage.blob import BlobServiceClient
    import base64
    
    connection_string = 'Your connection string'
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    
    
    container_name = 'test1'
    container_client = blob_service_client.get_container_client(container_name)
    
    blobs = container_client.list_blobs()
    
    for blob in blobs:
        blob_client = container_client.get_blob_client(blob.name)
        content_settings = blob_client.get_blob_properties().content_settings
        print('Blob name:', blob.name)
        blobmd5 = bytearray(content_settings.content_md5)
        decode = base64.b64encode(blobmd5).decode('utf-8')
        hex = bytes.fromhex(decode).hex()
        print(hex)
    

    Output:

    Blob name: abcd.vhd
    85e94a2f598c05f844976fe55b0ccfd7
    Blob name: student1.json
    f545d96c459da41b4fcdf93d25d40612
    

    enter image description here

    Update:

      import binascii
    
      blobmd5 = bytearray(content_settings.content_md5)
      hex = binascii.hexlify(blobmd5).decode('utf-8')
      print(hex) 
    

    enter image description here

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search