Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when writing very big dataframe to csv, with gzip compression

See original GitHub issue

Code Sample, a copy-pastable example if possible

df.to_csv('file.txt.gz', sep='\t', compression='gzip')

Problem description

I receive this error while writing to file a very big dataframe:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-28-48e45479ccfb> in <module>()
----> 1 df.to_csv('file.txt.gz', sep='\t', compression='gzip')

~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    156                 f.close()
    157                 with open(self.path_or_buf, 'r') as f:
--> 158                     data = f.read()
    159                 f, handles = _get_handle(self.path_or_buf, self.mode,
    160                                          encoding=encoding,

OSError: [Errno 22] Invalid argument

I cannot disclose the data but by running df.info() I received this information:

<class 'pandas.core.frame.DataFrame'>
Index: 10319 entries, Sample1 to Sample10319
Columns: 33707 entries, A1BG to ZZZ3
dtypes: float64(33707)
memory usage: 2.6+ GB

When looking at the disk, the dataframe has probably been dumped incompletely, and not compressed.

I am working with 16G of RAM on macOS 10.13.4 (17E202).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: 3.4.3 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

6reactions

hannahlindsleycommented, Jul 6, 2018

I’m also having this problem with the current version. Having to do this workaround:

n = 10000
list_df = [data[i:i+n] for i in range(0, data.shape[0], n)]

list_df[0].to_csv("data/iob.csv", index=False)

for l in list_df[1:]: 
    l.to_csv("data/iob.csv", index=False, header=False, mode='a')

2reactions

VelizarVESSELINOVcommented, Jun 25, 2018

I have the same issue with very large file without compression:

2018-06-25 12:44:27,378|root|64215|MainProcess|CRITICAL| Exception Information
2018-06-25 12:44:27,380|root|64215|MainProcess|CRITICAL| Type: <class 'OSError'>
2018-06-25 12:44:27,381|root|64215|MainProcess|CRITICAL| Value: [Errno 22] Invalid argument

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
    f.write(buf)