In-memory to_csv compression
See original GitHub issueCode Sample, a copy-pastable example if possible
# Attempt 1
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
test = df.to_csv(compression="gzip")
type(test)
RuntimeWarning: compression has no effect when passing file-like object as input.
Out: str
# Attempt 2
from io import BytesIO
b_buf = BytesIO()
df.to_csv(b_buf, compression="gzip")
Out: TypeError: a bytes-like object is required, not 'str'
Problem description
I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I’m temporarily keeping it in memory for that purpose.
Expected Output
I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).
Thank you in advance for your help!
Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.4 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.15.1 scipy: None pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:15
- Comments:11 (6 by maintainers)
Top GitHub Comments
Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn’t ideal (can’t be sure what the end-user’s local environment will look like).
My thought was something like this as a workaround:
For the ‘gzip’ compression,
_get_handle()
is not being called whenBytesIO()
is passed. This causes it to fail atcsv.writer(gzip.GzipFile(fileobj=BytesIO()))
in csvs.pyIf
_get_handle()
is called onBytesIO()
then what happens iscsv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO())))
which fails b/c GzipFile opens it as read only. Setting the mode will workcsv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO(), mode=mode)))
The ‘bz2’ compression fix is the same. ‘xz’ will not compress
BytesIO()
unless LZMACompressor is used. ‘zip’ has the custom workflow referenced by dhimmel, which complicates it further.There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.
In order to actually call
_get_handle()
onBytesIO()
, theelif hasattr(self.path_or_buf, 'write')
in csvs.py has to be changed so thatBytesIO()
doesn’t end up there butStringIO()
does. For Python 3 this is enough to fix it.For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c
CSVFormater()
setsencoding='ascii'
while _get_handle expects it to beNone
which is actually ‘ascii’.This is the test code I was using: