I write out a csv using pandas, and apply bzip2 compression as follows:
df.to_csv('/home/user/file.bz2', index=False, mode=writemode, header=header)
According to the documentation, to_csv
infers from the filename that it needs to compress it using the bzip2
method.
This ensures my ~100 MB CSV becomes ~ 23 MB.
However, if I decompress that bz2
file, and run the resulting csv file through bzip2
on my Mac with:
bzip2 /home/user/file
I get a file of ~7 MB! I get the same result if I run bzip2
on Debian.
What can cause this difference?
2
Answers
It turns out this was not caused by an outdated pandas, but by incorrect expectations on my side.
I actually create the dataset by appending to the csv over the course of a day (every minute to be precise), as such:
This results in the larger ~23 MB file.
If I store the dataframe in memory over the course of the day (or in a csv), and only write out once to a compressed file at the end I get the smaller ~7 MB file, regardless of pandas version.
Not sure if this is what is happening in your case, but bzip2 does support differing levels of compression that make tradeoffs between speed and size, and it’s possible that the level being chosen via
pandas
is different from the default for the CLI tool. Using thebz2
library:This one is a little strange in that compression 9 is generally supposed to be slower but smaller, but it is a pretty simplistic dataset I’ve generated here so it may just be a bit of a degenerate case.