In my code I used something like file = open(path +'/'+filename, 'wb')
to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/xd8xb9xd8xb1xd8xa8xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb')
but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape')
, I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I’m using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/xd8xb9xd8xb1xd8xa8xd9.txt
and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath)
or filepath.as_posix()
returns the string as directory/path/????????.txt
so it gives me an error when I try to encode it to any codec.
Currently I’m investigating if the issue’s related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
2
Answers
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such
WSGIDaemonProcess my_app locale='C.UTF-8'
in the Apache configuration file thanks to this post.I don’t understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There’s no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there’s no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you’re likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there’s no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files,
𝔐
andmañana
. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.Note:
In a comment, OP says that they had previously tried:
and received the error
Without more details, it’s hard to know how to respond to that. It’s possible that
open
will raise that error for a filesystem which doesn’t allow non-ascii characters, but that wouldn’t be normal on Linux.However, it’s worth noting that the string literal
is not the string you think it is.
x
escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal,"xd8xb9"
contains two characters, "O with stroke" (Ø
) and "superscript 1" (¹
); in other words, it is exactly the same as the string literal"u00d8u00b9"
.To get the Arabic letter ain (
ع
), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639:"u0639"
.If for some reason you insist on using explicit UTF-8 byte encoding, you can use a
byte
literal as the argument toopen
:But that’s not recommended.