skip to Main Content

Before getting into the problem, I would like to inform that I saw a lot of StackOverflow questions and python bugs reported on this problem but I am unable to root cause the issue

I am getting UnicodeEncodingError in a centos machine. Python is not built in the machine but the virtual environment with the required python version (3.6.7) is built somewhere else and copied here. So while starting the server, we activate the virtual environment and start the server.

the issue is observed in two scenarios

  1. logging input request parameter which has Unicode characters in it
  2. we pipe print statements to a log file and i can see error there while trying to print this Unicode string through code

the error looks as follows

print("u6211u7684u7535u8111u603bu662fu51fau73b0Windowsu9700u8981u6fc0u6d3b")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 56-63: ordinal not in range(128)

I verified following through python terminal

  • sys.getdefaultencoding() – utf-8
  • sys.getfilesystemencoding() – utf-8
  • sys.stdout.encoding
  • LANG is set to en_us.utf-8
  • LC_ALL is not set

I went through some solutions asking to modify LC_ALL or adding PYTHONIOENCODING in environment variables but I am not sure about modifying those without knowing side effects as the environment is a production environment.

Edit –
I tried to print the same set of characters which are breaking the code on above attempts through console by opening python terminal and its printing them without any issue.
Tried printing in this way

import sys
print("日本語")
sys.stdout.write("日本語n")

but through code, it is raising UnicodeEncodingError

I would like to know how to resolve this?

Thanks

2

Answers


  1. Chosen as BEST ANSWER

    Finally got rid of this issue in this way

    I observed the issue mentioned in question under two different circumstances

    The first scenario - With all settings posted in the question, all language-related encodings are UTF-8, it worked after our prod server restart without any changes. Still don't know what made it not to work previously and work after restarting the machine.

    The second scenario - All LC variables are set to POSIX in our client environment. I went through many solutions which suggested to modify LANG or LC_ALL to UTF-8. But changing all the encoding configurations may lead to problems like date time conversion etc... which are locale-based.

    Fix - only changed LC_CTYPE to UTF-8 in our case it is en_US.UTF-8

    export LC_CTYPE="en_US.UTF-8"

    and it worked.


  2. most ascii terminals cannot render unicode characters (you could try changing the font… maybe that would work) … so even if you get past your encoding error your
    print will probably look like �������Windows�������

    if you run it in idle it would work …

    i would strongly recommend just print(repr(string_that_might_have_unicode)) as that will guarantee an ascii printable representation … and nothing is worse than crashing your application because you were trying to print some debug information …
    (printing the repr will something more like appear like b"'\u6211\u7684\u7535\u8111\u603b\u662f\u51fa\u73b0Windows\u9700\u8981\u6fc0\
    u6d3b'"

    you could also try to encode it manually before printing it

    print(my_unicode_string.encode("utf8"))

    that might work … in some terminals … but really … just print the repr unless you are showing that to the user (but since you talk about server i imagine this to not be a terminal client application, but debug information that is being printed(and redirected to a logfile?))

    if you really need to print the exact unicode to the terminal instead of the repr then i think you need to do the manual decode step to send utf8 to the actual terminal … but its much easier to just always print the repr when logging (this has the benefit of showing you invisible and whitespace characters… but not great if its part of a client application)

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search