skip to Main Content

If I’m encoding a string using utf-16be and decoding the encoded string using utf-8, I’m not getting any error and the output seems to be correctly getting printed on the screen as well but still I’m not able to convert the decoded string into Python representation using json module.

import json

str = '{"foo": "bar"}'
encoded_str = str.encode("utf-16be")
decoded_str = encoded_str.decode('utf-8')
print(decoded_str)
print(json.JSONDecoder().decode(decoded_str))

I know that encoded string should be decoded using the same encoding, but why this behaviour is what I’m trying to understand? I want to know:

  1. Why encoding str with utf-16be  and decoding encoded_str  with utf-8 doesn’t result in an error?

  2. As encoding and decoding  is not resulting in an error and the decoded_str is a valid JSON (as can be seen through the print statement), why decode(decoded_str) result in an error?

  3. Why writing the output to a file and viewing the file through less command show it as binary file?

    file = open("data.txt", 'w')
    file.write(decoded_str)
    

    When using less command to view the data.txt:

    "data.txt" may be a binary file.  See it anyway?
    
  4. If the decoded_str is an invalid JSON or something else, how can I view it in its original form (print() is printing it as a valid JSON )

I’m using Python 3.10.12 on Ubuntu 22.04.4 LTS

2

Answers


    1. Why encoding str with utf-16be and decoding encoded_str with utf-8 doesn’t result in an error?

    Because in this case, the resulting bytes of str.encode("utf-16be") are also valid UTF-8. This is in fact always the case with ASCII characters, you really need to go above U+007F to trigger possible errors here (eg. use the string str = '{"foo": "!"}' which uses a full-width exclamation mark, U+FF01).

    1. As encoding and decoding is not resulting in an error and the decoded_str is a valid JSON (as can be seen through the print statement), why decode(decoded_str) result in an error?

    Just because you can print a string does not make it valid JSON. In particular because of the encoding to UTF-16, a bunch of null bytes got added. For example, f in UTF-16BE is 0x0066. Those bytes when re-encoded in UTF-8 actually constitute two characters, f and the null character 0x00. Based on my reading of the JSON spec, null characters are not allowed and that is why decode(decoded_str) fails.

    1. Why writing the output to a file and viewing the file through less command show it as binary file?

    Probably those null bytes again. With a lot of null bytes, less is probably flagging it might be a binary file as this is relatively uncommon in UTF-8 (and Linux much prefers UTF-8 over UTF-16)

    1. If the decoded_str is an invalid JSON or something else, how can I view it in its original form (print() is printing it as a valid JSON )

    Too many possible answers here, it really depends on what the real use case is here. The quickest one is just don’t encode/decode with different encodings. The next quickest is reverse the encode/decode process, though this is not lossless with all strings or encoding possibilities, in particular the surrogate range when dealing with a UTF-16 + UTF-8 mix-up.

    Login or Signup to reply.
  1. Print the resulting encoding and you will see the issue:

    import json
    
    str = '{"foo": "bar"}'
    encoded_str = str.encode("utf-16be")
    print(encoded_str)
    print(encoded_str.hex(' '))
    decoded_str = encoded_str.decode('utf-8')
    print(decoded_str)
    

    Output:

    b'x00{x00"x00fx00ox00ox00"x00:x00 x00"x00bx00ax00rx00"x00}'
    00 7b 00 22 00 66 00 6f 00 6f 00 22 00 3a 00 20 00 22 00 62 00 61 00 72 00 22 00 7d
     { " f o o " :   " b a r " }
    

    Note that on my terminal, U+0000 (NUL) prints as a space so you can see that there is something off on the decoded string. From the OP description, their terminal doesn’t print anything for nulls hence the JSON string probably still looked like {"foo": "bar"}.

    UTF-16BE encoding has a lot of null bytes when encoding ASCII. Since your original string was ASCII, all the bytes including the nulls are valid ASCII. UTF-8 is a superset of ASCII, so it decodes correctly, but includes all those null bytes. Null bytes are not allowed in JSON hence the error decoding it as JSON. Null bytes are used to detect a binary file as well.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search