I have a function where the arguments passed are 5 filepaths. However, the first path is to a csv.gz where there seems to be an undefined character inside of the file. How can I work around this?
I’m using Python version 3.11.1. Code and error message shown below.
function(r"filepath1", r"filepath2", r"filepath3", r"filepath4", r"filepath5")
Error Message:
Cell In[3], line 8, in function(filepath1, filepath2, filepath3, filepath4, filepath5)
6 file1DateMap = {}
7 infd = open(file1path1, 'r')
8 infd.readline()
9 for line in infd:
10 tokens = line.strip().split(',')
File ~AppDataLocalProgramsPythonPython311Libencodingscp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined
I tried
file = open(filename, encoding="utf8")
but encoding was undefined in my version of Python.
I tried the "with open" method
file2 = r"file2path"
file3 = r"file3path"
file4 = r"file4path"
file5 = r"file5path"
file1name = r"file1path"
with open(file1name, 'r') as file1:
function(file1, file2, file3, file4, file5)
but the function was expecting a string:
TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper
I am expecting the function to run and write the processed output to folders on my desktop.
UPDATE
I checked the encoding of the file in Visual Studio Code, it stated UTF 8. I wrote the following code:
with open(r"path1", encoding="utf8") as openfile1:
file1 = openfile1.read()
Received this error:
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte
UPDATE 2
Checked encoding with this code
with open(r"filepath1") as f:
print(f)
encoding=’cp1252′
However now when I pass the new encoding argument:
with open(r"path1", encoding="cp1252") as openfile1:
file1 = openfile1.read()
I am back to square 1 with the following error message:
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 94: character maps to undefined
UPDATE 3
Gzip worked, sort of. I used the following code:
import gzip
with gzip.open(r"path1", mode="rb") as openfile1:
file1 = openfile1.read()
I was also able to read the first 10 lines of the file. However, when I passed it back into the function, it now gives me this error:
FileNotFoundError: [Errno 2] No such file or directory
But then prints all of the fields in the file. Does this have to do with the compression option?
UPDATE 4
I checked the current wd and the absolute file path for the newly opened gzip file:
cwd = os.getcwd()
cwd
out: venvpath
dir_path = os.path.dirname(os.path.realpath(file1))
dir_path
Received this error message:
File :700, in realpath(path, strict)
ValueError: _getfinalpathname: path too long for Windows
Not sure if this is contributing to not being able to pass the file into the function.
2
Answers
There are several bits of confusion present in this source code.
Please understand that
file1
is an open file handle,with a
type(...)
of TextIOWrapper.It is iterable, you can request lines of text from it.
In contrast
file2
et al. arestr
pathnames;you cannot request lines of filesystem text from those objects.
The parallel structure of naming you chose for them
is likely to confuse yourself plus any hapless maintenance
engineer who encounters this code in the coming months.
Recommend you adopt names like
path2
..path5
.Your default encoding appears to be
CodePage1252.
You requested that encoding with
open(file1name, 'r')
by leaving out the optional
encoding=
parameter.Note that
mode='r'
is the default,so you could have left that one out, as well.
In contrast,
open(filename, encoding="utf8")
opened for read access using quite a different encoding.
The encoding is a property of the underlying .CSV file,
and not of your program.
That is, you must know what the correct underlying encoding is,
and you must tell
open
the correct encoding.You can do that by default or you can do it explicitly,
as long as you get it right.
I recommend doing it explicitly.
If you don’t know the encoding,
use
/usr/bin/file
,/usr/local/bin/iconv
,or a text editor to learn what it is,
and perhaps to change it to UTF-8
if you’re unhappy with its current encoding.
Most files on most modern machines
should be UTF-8 encoded — to do otherwise
is to invite trouble. But I digress.
Once you’ve settled on some known encoding,
pass it in to
open
via theencoding=
parameter and you’re in business!
If you have a CSV file compressed into a gzip file, you should be able to read the gzip file as simply as:
I believe you’ll want
rt
to read it as text (and notrb
which will return non-decoded bytes); and of course pick the actual encoding of the file (I always use utf-8 for my examples).To further decode the CSV in the text file
f
, I recommend using the standard library’s csv module: