skip to Main Content

I use Python 3.9.1 and Linux (CentOS 7). I want to print unicode characters to the console. I want to do everything in UTF-8. If I open the python interactive console and write:

print("├")

all goes well and it prints:

Now I put the same line print("├") in a file, then save the file with UTF-8 encoding (default on linux).
I then get the following error:

UnicodeEncodeError: 'latin-1' codec can't encode character 'u251c' in position 0: ordinal not in range(256)

Where does that "latin-1" comes from ?

I also to force UTF-8 in the first line (which should be the default anyways in Python3)

# coding: utf8

but it does not change anything.

More info on what does work and what doesn’t:

s = "├"
#print(s) # FAIL
s2 = s.encode('utf8')
print(s2) # prints b'xe2x94x9c'
print(s2.decode('latin-1')) # prints the right thing

What is happening here? Can I get the same behavior in the script as in the interactive console?

2

Answers


  1. Chosen as BEST ANSWER

    The reason was that my LANG environment variable was set to en_US, whereas it should have been en_US.UTF-8.

    Another way to solve the problem is to set PYTHONENCODING to UTF-8 (it was empty for me).

    I still don't fully understand why Python is confused by this only for non-interactive scripts though...

    More details: https://simulrpi.readthedocs.io/en/latest/display_problems.html


  2. s = "├" (in your UTF-8 encoded source file) assigns the character u251C to the first position of s, a UTF-8 encoded string.

    print(s) fails because print here ties to send the bytes representing s to the standard output, which expects latin-1 encoding. Effectively, something like s.encode('latin-1') fails, as the first character in the string cannot be encoded correctly.

    If you literally run that statement (s.encode('latin-1')) instead, you’ll find that it causes the same error.

    s2 = s.encode('utf8') works just fine, it tells Python to explicitly encode the contents of s into a sequence of bytes. s2 now holds the byte encoding of s, using the UTF-8 encoding. (perhaps ‘b’ would have been a better variable name, it’s not a string after all)

    print(s2) does indeed print b'xe2x94x9c', since it simply prints the Python representation of a byte sequence. It’s not a string, so you get the representation of the value printed. As it should be, it’s the literal you could have used to define s2, i.e. s2 = b'xe2x94x9c' wouldn’t change anything.

    print(s2.decode('latin-1')) printing the right thing is a bit of a mystery. s2 is the correct UTF-8 byte sequence for the U+251C character (https://www.fileformat.info/info/unicode/char/251c/index.htm)

    Apparently your Python takes the result of s2.decode('latin-1'), encodes it as a latin-1 byte sequence again, which then gets written to the output stream where it renders correctly for you.

    Since Python would be doing the same for the earlier print statements trying to print a UTF-8 encoded string, it explains why those don’t display correctly (or not at all).

    The solution would be to tell Python explicitly to override the encoding for standard out as UTF-8, so you can print a UTF-8 string without Python trying to encode it as a latin-1 encoding byte sequence (which will fail).

    As documented here https://docs.python.org/3/using/cmdline.html#envvar-PYTHONIOENCODING you can do that by setting SET PYTHONENCODING=UTF-8. Conversely, if you want to replicate the problem in the interactive environment, you can probably get that behaviour with PYTHONLEGACYWINDOWSSTDIO.

    Where and when to set this depends on your system environment. Do other applications rely on older scripts or other versions of Python not doing this? If not, you can consider setting a global system environment variable. Alternatively, you can set it just before executing a script, i.e. in a batch file running it.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search