Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In-memory to_csv compression

See original GitHub issue

Code Sample, a copy-pastable example if possible

# Attempt 1
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
test = df.to_csv(compression="gzip")
type(test)

RuntimeWarning: compression has no effect when passing file-like object as input.
Out: str

# Attempt 2
from io import BytesIO
b_buf = BytesIO()
df.to_csv(b_buf, compression="gzip")

Out: TypeError: a bytes-like object is required, not 'str'

Problem description

I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I’m temporarily keeping it in memory for that purpose.

Expected Output

I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).

Thank you in advance for your help!

Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.15.1 scipy: None pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Reactions:15
Comments:11 (6 by maintainers)

Top GitHub Comments

5reactions

ZaxRcommented, Sep 6, 2018

Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn’t ideal (can’t be sure what the end-user’s local environment will look like).

My thought was something like this as a workaround:

import gzip
from io import BytesIO
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
b_buf = BytesIO()
with gzip.open(b_buf, 'wb') as f:
    f.write(df.to_string().encode())

4reactions

silverdrake11commented, Oct 13, 2018

For the ‘gzip’ compression, _get_handle() is not being called when BytesIO() is passed. This causes it to fail at csv.writer(gzip.GzipFile(fileobj=BytesIO())) in csvs.py

If _get_handle() is called on BytesIO() then what happens is csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO()))) which fails b/c GzipFile opens it as read only. Setting the mode will work csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO(), mode=mode)))

The ‘bz2’ compression fix is the same. ‘xz’ will not compress BytesIO() unless LZMACompressor is used. ‘zip’ has the custom workflow referenced by dhimmel, which complicates it further.

There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

In order to actually call _get_handle()on BytesIO(), the elif hasattr(self.path_or_buf, 'write') in csvs.py has to be changed so that BytesIO() doesn’t end up there but StringIO() does. For Python 3 this is enough to fix it.

For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c CSVFormater() sets encoding='ascii' while _get_handle expects it to be None which is actually ‘ascii’.

This is the test code I was using:

hello = BytesIO()
test = df.to_csv(hello, compression='gzip')
print(hello.getvalue())

Top Results From Across the Web

Still Saving Your Data in CSV? Try these other options

Pandas supports compression when you save your dataframes to CSV files. Specifically, Pandas supports the following compression algorithms:.

Reducing Pandas memory usage #1: lossless compression

Load a large CSV or other data into Pandas using less memory with techniques like dropping columns, smaller numeric dtypes, categoricals, ...

Compress an existing large CSV file with Python using GZIP in ...

Let's say you call that object gz . Then use gz.compress(data) for each chunk of the .csv you download. Upload the result returned...

How to Efficiently Transform a CSV File and Upload it in ...

Now, there can be multiple considerations in terms of choosing the right compression format depending upon how the down-stream processing will be done...

How to compress CSV file efficiently in just 25 lines of code

It is a python library that is used to load and read the data frame. In our case, we are using a CSV...