skip to Main Content

I write out a csv using pandas, and apply bzip2 compression as follows:

df.to_csv('/home/user/file.bz2', index=False, mode=writemode, header=header)

According to the documentation, to_csv infers from the filename that it needs to compress it using the bzip2 method.

This ensures my ~100 MB CSV becomes ~ 23 MB.

However, if I decompress that bz2 file, and run the resulting csv file through bzip2 on my Mac with:

bzip2 /home/user/file

I get a file of ~7 MB! I get the same result if I run bzip2 on Debian.

What can cause this difference?

2

Answers


  1. Chosen as BEST ANSWER

    It turns out this was not caused by an outdated pandas, but by incorrect expectations on my side.

    I actually create the dataset by appending to the csv over the course of a day (every minute to be precise), as such:

    if first_data_of_the_day:
        df.to_csv('/home/user/file.bz2', index=False, mode='w', header=True)
    else:
        df.to_csv('/home/user/file.bz2', index=False, mode='a', header=False)
    

    This results in the larger ~23 MB file.

    If I store the dataframe in memory over the course of the day (or in a csv), and only write out once to a compressed file at the end I get the smaller ~7 MB file, regardless of pandas version.


  2. Not sure if this is what is happening in your case, but bzip2 does support differing levels of compression that make tradeoffs between speed and size, and it’s possible that the level being chosen via pandas is different from the default for the CLI tool. Using the bz2 library:

    In [118]: df = pd.DataFrame(np.random.randint(0, 100, [100000,5]))
    
    In [119]: len(df.to_csv(None))
    Out[119]: 2138880
    
    In [120]: len(bz2.compress(df.to_csv(None).encode('ascii'), compresslevel=1))
    Out[120]: 702709
    
    In [121]: len(bz2.compress(df.to_csv(None).encode('ascii'), compresslevel=9))
    Out[121]: 730415
    

    This one is a little strange in that compression 9 is generally supposed to be slower but smaller, but it is a pretty simplistic dataset I’ve generated here so it may just be a bit of a degenerate case.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search