Visual Studio Code - Opening files (csv.gz) containing undefined characters and passing files into function

JTA1618
August 14, 2023
226 views
1 vote
2 Answers

I have a function where the arguments passed are 5 filepaths. However, the first path is to a csv.gz where there seems to be an undefined character inside of the file. How can I work around this?

I’m using Python version 3.11.1. Code and error message shown below.

function(r"filepath1", r"filepath2", r"filepath3", r"filepath4", r"filepath5")

Error Message:

Cell In[3], line 8, in function(filepath1, filepath2, filepath3, filepath4, filepath5)
 6 file1DateMap = {}
 7 infd = open(file1path1, 'r')
 8 infd.readline()
 9 for line in infd:
10     tokens = line.strip().split(',')
 
File ~AppDataLocalProgramsPythonPython311Libencodingscp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]
 
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 94: character maps to undefined

I tried

file = open(filename, encoding="utf8")

but encoding was undefined in my version of Python.

I tried the "with open" method

file2 = r"file2path"
file3 = r"file3path"
file4 = r"file4path"
file5 = r"file5path"
file1name = r"file1path"
with open(file1name, 'r') as file1:
    function(file1, file2, file3, file4, file5)

but the function was expecting a string:

TypeError: expected str, bytes or os.PathLike object, not TextIOWrapper

I am expecting the function to run and write the processed output to folders on my desktop.

UPDATE

I checked the encoding of the file in Visual Studio Code, it stated UTF 8. I wrote the following code:

with open(r"path1", encoding="utf8") as openfile1:
    file1 = openfile1.read()

Received this error:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

UPDATE 2

Checked encoding with this code

with open(r"filepath1") as f:
    print(f)

encoding=’cp1252′

However now when I pass the new encoding argument:

with open(r"path1", encoding="cp1252") as openfile1:
    file1 = openfile1.read()

I am back to square 1 with the following error message:

UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 94: character maps to undefined

UPDATE 3

Gzip worked, sort of. I used the following code:

import gzip
with gzip.open(r"path1", mode="rb") as openfile1:
    file1 = openfile1.read()

I was also able to read the first 10 lines of the file. However, when I passed it back into the function, it now gives me this error:

FileNotFoundError: [Errno 2] No such file or directory

But then prints all of the fields in the file. Does this have to do with the compression option?

UPDATE 4

I checked the current wd and the absolute file path for the newly opened gzip file:

cwd = os.getcwd()
cwd

out: venvpath

dir_path = os.path.dirname(os.path.realpath(file1))
dir_path

Received this error message:

File :700, in realpath(path, strict)

ValueError: _getfinalpathname: path too long for Windows

Not sure if this is contributing to not being able to pass the file into the function.

Answers

- J_H
- August 13, 2023 at 8:59 pm
- 0 votes
0
There are several bits of confusion present in this source code.
```
with open(file1name, 'r') as file1:
    function(file1, file2, file3, file4, file5)
```
Please understand that file1 is an open file handle,
with a type(...) of TextIOWrapper.
It is iterable, you can request lines of text from it.
In contrast file2 et al. are str pathnames;
you cannot request lines of filesystem text from those objects.

The parallel structure of naming you chose for them
is likely to confuse yourself plus any hapless maintenance
engineer who encounters this code in the coming months.
Recommend you adopt names like path2 .. path5.

Your default encoding appears to be
CodePage1252.
You requested that encoding with open(file1name, 'r')
by leaving out the optional encoding= parameter.
Note that mode='r' is the default,
so you could have left that one out, as well.

In contrast, open(filename, encoding="utf8")
opened for read access using quite a different encoding.

The encoding is a property of the underlying .CSV file,
and not of your program.
That is, you must know what the correct underlying encoding is,
and you must tell open the correct encoding.
You can do that by default or you can do it explicitly,
as long as you get it right.
I recommend doing it explicitly.

If you don’t know the encoding,
use /usr/bin/file, /usr/local/bin/iconv,
or a text editor to learn what it is,
and perhaps to change it to UTF-8
if you’re unhappy with its current encoding.

Most files on most modern machines
should be UTF-8 encoded — to do otherwise
is to invite trouble. But I digress.

Once you’ve settled on some known encoding,
pass it in to open via the encoding=
parameter and you’re in business!
Login or Signup to reply.

- ZachYoung
- August 14, 2023 at 5:19 pm
- 0 votes
0
If you have a CSV file compressed into a gzip file, you should be able to read the gzip file as simply as:
```
with gzip.open("input.csv.gz", "rt", newline="", encoding="utf-8") as f:
```
I believe you’ll want rt to read it as text (and not rb which will return non-decoded bytes); and of course pick the actual encoding of the file (I always use utf-8 for my examples).

To further decode the CSV in the text file f, I recommend using the standard library’s csv module:
```
...
    reader = csv.reader(f)
    for row in reader:
        print(row)
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Visual Studio Code – Opening files (csv.gz) containing undefined characters and passing files into function

Answers