If I’m encoding a string using utf-16be
and decoding the encoded string using utf-8
, I’m not getting any error and the output seems to be correctly getting printed on the screen as well but still I’m not able to convert the decoded string into Python representation using json module.
import json
str = '{"foo": "bar"}'
encoded_str = str.encode("utf-16be")
decoded_str = encoded_str.decode('utf-8')
print(decoded_str)
print(json.JSONDecoder().decode(decoded_str))
I know that encoded string should be decoded using the same encoding, but why this behaviour is what I’m trying to understand? I want to know:
-
Why encoding
str
withutf-16be
and decodingencoded_str
withutf-8
doesn’t result in an error? -
As encoding and decoding is not resulting in an error and the
decoded_str
is a valid JSON (as can be seen through the print statement), whydecode(decoded_str)
result in an error? -
Why writing the output to a file and viewing the file through
less
command show it as binary file?file = open("data.txt", 'w') file.write(decoded_str)
When using
less
command to view thedata.txt
:"data.txt" may be a binary file. See it anyway?
-
If the
decoded_str
is an invalid JSON or something else, how can I view it in its original form (print()
is printing it as a valid JSON )
I’m using Python 3.10.12
on Ubuntu 22.04.4 LTS
2
Answers
Because in this case, the resulting bytes of
str.encode("utf-16be")
are also valid UTF-8. This is in fact always the case with ASCII characters, you really need to go above U+007F to trigger possible errors here (eg. use the stringstr = '{"foo": "!"}'
which uses a full-width exclamation mark, U+FF01).Just because you can print a string does not make it valid JSON. In particular because of the encoding to UTF-16, a bunch of null bytes got added. For example,
f
in UTF-16BE is0x0066
. Those bytes when re-encoded in UTF-8 actually constitute two characters,f
and the null character0x00
. Based on my reading of the JSON spec, null characters are not allowed and that is whydecode(decoded_str)
fails.Probably those null bytes again. With a lot of null bytes,
less
is probably flagging it might be a binary file as this is relatively uncommon in UTF-8 (and Linux much prefers UTF-8 over UTF-16)Too many possible answers here, it really depends on what the real use case is here. The quickest one is just don’t encode/decode with different encodings. The next quickest is reverse the encode/decode process, though this is not lossless with all strings or encoding possibilities, in particular the surrogate range when dealing with a UTF-16 + UTF-8 mix-up.
Print the resulting encoding and you will see the issue:
Output:
Note that on my terminal, U+0000 (NUL) prints as a space so you can see that there is something off on the decoded string. From the OP description, their terminal doesn’t print anything for nulls hence the JSON string probably still looked like
{"foo": "bar"}
.UTF-16BE encoding has a lot of null bytes when encoding ASCII. Since your original string was ASCII, all the bytes including the nulls are valid ASCII. UTF-8 is a superset of ASCII, so it decodes correctly, but includes all those null bytes. Null bytes are not allowed in JSON hence the error decoding it as JSON. Null bytes are used to detect a binary file as well.