Why is there a difference in filesize between using standalone bzip2 and pandas to_csv(, compression='bz2') function? - Debian - PhpOut

SaaruLindest248kke
May 22, 2020
203 views
0 votes
2 Answers

I write out a csv using pandas, and apply bzip2 compression as follows:

df.to_csv('/home/user/file.bz2', index=False, mode=writemode, header=header)

According to the documentation, to_csv infers from the filename that it needs to compress it using the bzip2 method.

This ensures my ~100 MB CSV becomes ~ 23 MB.

However, if I decompress that bz2 file, and run the resulting csv file through bzip2 on my Mac with:

bzip2 /home/user/file

I get a file of ~7 MB! I get the same result if I run bzip2 on Debian.

What can cause this difference?

Tags: compression pandas python

Answers

Chosen as BEST ANSWER
- SaaruLindest248kke
- May 23, 2020 at 9:45 pm
- 0 votes
0
It turns out this was not caused by an outdated pandas, but by incorrect expectations on my side.

I actually create the dataset by appending to the csv over the course of a day (every minute to be precise), as such:
```
if first_data_of_the_day:
    df.to_csv('/home/user/file.bz2', index=False, mode='w', header=True)
else:
    df.to_csv('/home/user/file.bz2', index=False, mode='a', header=False)
```
This results in the larger ~23 MB file.

If I store the dataframe in memory over the course of the day (or in a csv), and only write out once to a compressed file at the end I get the smaller ~7 MB file, regardless of pandas version.

(Edit)

- randy
- May 22, 2020 at 6:43 pm
- 0 votes
0
Not sure if this is what is happening in your case, but bzip2 does support differing levels of compression that make tradeoffs between speed and size, and it’s possible that the level being chosen via pandas is different from the default for the CLI tool. Using the bz2 library:
```
In [118]: df = pd.DataFrame(np.random.randint(0, 100, [100000,5]))

In [119]: len(df.to_csv(None))
Out[119]: 2138880

In [120]: len(bz2.compress(df.to_csv(None).encode('ascii'), compresslevel=1))
Out[120]: 702709

In [121]: len(bz2.compress(df.to_csv(None).encode('ascii'), compresslevel=9))
Out[121]: 730415
```
This one is a little strange in that compression 9 is generally supposed to be slower but smaller, but it is a pretty simplistic dataset I’ve generated here so it may just be a bit of a degenerate case.
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.