Error when writing very big dataframe to csv, with gzip compression
See original GitHub issueCode Sample, a copy-pastable example if possible
df.to_csv('file.txt.gz', sep='\t', compression='gzip')
Problem description
I receive this error while writing to file a very big dataframe:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-28-48e45479ccfb> in <module>()
----> 1 df.to_csv('file.txt.gz', sep='\t', compression='gzip')
~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
1743 doublequote=doublequote,
1744 escapechar=escapechar, decimal=decimal)
-> 1745 formatter.save()
1746
1747 if path_or_buf is None:
~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
156 f.close()
157 with open(self.path_or_buf, 'r') as f:
--> 158 data = f.read()
159 f, handles = _get_handle(self.path_or_buf, self.mode,
160 encoding=encoding,
OSError: [Errno 22] Invalid argument
I cannot disclose the data but by running df.info()
I received this information:
<class 'pandas.core.frame.DataFrame'>
Index: 10319 entries, Sample1 to Sample10319
Columns: 33707 entries, A1BG to ZZZ3
dtypes: float64(33707)
memory usage: 2.6+ GB
When looking at the disk, the dataframe has probably been dumped incompletely, and not compressed.
I am working with 16G of RAM on macOS 10.13.4 (17E202).
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: 3.4.3 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
I’m also having this problem with the current version. Having to do this workaround:
I have the same issue with very large file without compression: