question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

In-memory to_csv compression

See original GitHub issue

Code Sample, a copy-pastable example if possible

# Attempt 1
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
test = df.to_csv(compression="gzip")
type(test)
RuntimeWarning: compression has no effect when passing file-like object as input.
Out: str
# Attempt 2
from io import BytesIO
b_buf = BytesIO()
df.to_csv(b_buf, compression="gzip")
Out: TypeError: a bytes-like object is required, not 'str'

Problem description

I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I’m temporarily keeping it in memory for that purpose.

Expected Output

I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).

Thank you in advance for your help!

Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.15.1 scipy: None pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:15
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

5reactions
ZaxRcommented, Sep 6, 2018

Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn’t ideal (can’t be sure what the end-user’s local environment will look like).

My thought was something like this as a workaround:

import gzip
from io import BytesIO
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
b_buf = BytesIO()
with gzip.open(b_buf, 'wb') as f:
    f.write(df.to_string().encode())
4reactions
silverdrake11commented, Oct 13, 2018

For the ‘gzip’ compression, _get_handle() is not being called when BytesIO() is passed. This causes it to fail at csv.writer(gzip.GzipFile(fileobj=BytesIO())) in csvs.py

If _get_handle() is called on BytesIO() then what happens is csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO()))) which fails b/c GzipFile opens it as read only. Setting the mode will work csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO(), mode=mode)))

The ‘bz2’ compression fix is the same. ‘xz’ will not compress BytesIO() unless LZMACompressor is used. ‘zip’ has the custom workflow referenced by dhimmel, which complicates it further.

There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

In order to actually call _get_handle()on BytesIO(), the elif hasattr(self.path_or_buf, 'write') in csvs.py has to be changed so that BytesIO() doesn’t end up there but StringIO() does. For Python 3 this is enough to fix it.

For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c CSVFormater() sets encoding='ascii' while _get_handle expects it to be None which is actually ‘ascii’.

This is the test code I was using:

hello = BytesIO()
test = df.to_csv(hello, compression='gzip')
print(hello.getvalue())

Read more comments on GitHub >

github_iconTop Results From Across the Web

Still Saving Your Data in CSV? Try these other options
Pandas supports compression when you save your dataframes to CSV files. Specifically, Pandas supports the following compression algorithms:.
Read more >
Reducing Pandas memory usage #1: lossless compression
Load a large CSV or other data into Pandas using less memory with techniques like dropping columns, smaller numeric dtypes, categoricals, ...
Read more >
Compress an existing large CSV file with Python using GZIP in ...
Let's say you call that object gz . Then use gz.compress(data) for each chunk of the .csv you download. Upload the result returned...
Read more >
How to Efficiently Transform a CSV File and Upload it in ...
Now, there can be multiple considerations in terms of choosing the right compression format depending upon how the down-stream processing will be done...
Read more >
How to compress CSV file efficiently in just 25 lines of code
It is a python library that is used to load and read the data frame. In our case, we are using a CSV...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found