I use Python 3.9.1 and Linux (CentOS 7). I want to print unicode characters to the console. I want to do everything in UTF-8. If I open the python interactive console and write:
print("├")
all goes well and it prints:
├
Now I put the same line print("├")
in a file, then save the file with UTF-8 encoding (default on linux).
I then get the following error:
UnicodeEncodeError: 'latin-1' codec can't encode character 'u251c' in position 0: ordinal not in range(256)
Where does that "latin-1" comes from ?
I also to force UTF-8 in the first line (which should be the default anyways in Python3)
# coding: utf8
but it does not change anything.
More info on what does work and what doesn’t:
s = "├"
#print(s) # FAIL
s2 = s.encode('utf8')
print(s2) # prints b'xe2x94x9c'
print(s2.decode('latin-1')) # prints the right thing
What is happening here? Can I get the same behavior in the script as in the interactive console?
2
Answers
The reason was that my
LANG
environment variable was set toen_US
, whereas it should have beenen_US.UTF-8
.Another way to solve the problem is to set
PYTHONENCODING
toUTF-8
(it was empty for me).I still don't fully understand why Python is confused by this only for non-interactive scripts though...
More details: https://simulrpi.readthedocs.io/en/latest/display_problems.html
s = "├"
(in your UTF-8 encoded source file) assigns the characteru251C
to the first position ofs
, a UTF-8 encoded string.print(s)
fails because print here ties to send the bytes representings
to the standard output, which expectslatin-1
encoding. Effectively, something likes.encode('latin-1')
fails, as the first character in the string cannot be encoded correctly.If you literally run that statement (
s.encode('latin-1')
) instead, you’ll find that it causes the same error.s2 = s.encode('utf8')
works just fine, it tells Python to explicitly encode the contents ofs
into a sequence of bytes.s2
now holds the byte encoding ofs
, using the UTF-8 encoding. (perhaps ‘b’ would have been a better variable name, it’s not a string after all)print(s2)
does indeed printb'xe2x94x9c'
, since it simply prints the Python representation of a byte sequence. It’s not a string, so you get the representation of the value printed. As it should be, it’s the literal you could have used to defines2
, i.e.s2 = b'xe2x94x9c'
wouldn’t change anything.print(s2.decode('latin-1'))
printing the right thing is a bit of a mystery.s2
is the correct UTF-8 byte sequence for the U+251C character (https://www.fileformat.info/info/unicode/char/251c/index.htm)Apparently your Python takes the result of
s2.decode('latin-1')
, encodes it as alatin-1
byte sequence again, which then gets written to the output stream where it renders correctly for you.Since Python would be doing the same for the earlier print statements trying to print a UTF-8 encoded string, it explains why those don’t display correctly (or not at all).
The solution would be to tell Python explicitly to override the encoding for standard out as UTF-8, so you can print a UTF-8 string without Python trying to encode it as a
latin-1
encoding byte sequence (which will fail).As documented here https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING you can do that by setting
SET PYTHONENCODING=UTF-8
. Conversely, if you want to replicate the problem in the interactive environment, you can probably get that behaviour withPYTHONLEGACYWINDOWSSTDIO
.Where and when to set this depends on your system environment. Do other applications rely on older scripts or other versions of Python not doing this? If not, you can consider setting a global system environment variable. Alternatively, you can set it just before executing a script, i.e. in a batch file running it.