question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error when writing very big dataframe to csv, with gzip compression

See original GitHub issue

Code Sample, a copy-pastable example if possible

df.to_csv('file.txt.gz', sep='\t', compression='gzip')

Problem description

I receive this error while writing to file a very big dataframe:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-28-48e45479ccfb> in <module>()
----> 1 df.to_csv('file.txt.gz', sep='\t', compression='gzip')

~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    156                 f.close()
    157                 with open(self.path_or_buf, 'r') as f:
--> 158                     data = f.read()
    159                 f, handles = _get_handle(self.path_or_buf, self.mode,
    160                                          encoding=encoding,

OSError: [Errno 22] Invalid argument

I cannot disclose the data but by running df.info() I received this information:

<class 'pandas.core.frame.DataFrame'>
Index: 10319 entries, Sample1 to Sample10319
Columns: 33707 entries, A1BG to ZZZ3
dtypes: float64(33707)
memory usage: 2.6+ GB

When looking at the disk, the dataframe has probably been dumped incompletely, and not compressed.

I am working with 16G of RAM on macOS 10.13.4 (17E202).

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0 pytest: None pip: 10.0.1 setuptools: 39.1.0 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: 3.4.3 numexpr: 2.6.5 feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: 4.6.0 html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

6reactions
hannahlindsleycommented, Jul 6, 2018

I’m also having this problem with the current version. Having to do this workaround:

n = 10000
list_df = [data[i:i+n] for i in range(0, data.shape[0], n)]

list_df[0].to_csv("data/iob.csv", index=False)

for l in list_df[1:]: 
    l.to_csv("data/iob.csv", index=False, header=False, mode='a')
2reactions
VelizarVESSELINOVcommented, Jun 25, 2018

I have the same issue with very large file without compression:

2018-06-25 12:44:27,378|root|64215|MainProcess|CRITICAL| Exception Information
2018-06-25 12:44:27,380|root|64215|MainProcess|CRITICAL| Type: <class 'OSError'>
2018-06-25 12:44:27,381|root|64215|MainProcess|CRITICAL| Value: [Errno 22] Invalid argument

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
    f.write(buf)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Apply GZIP compression to a CSV in Python Pandas
It is done very easily with pandas import pandas as pd. Write a pandas dataframe to disc as gunzip compressed csv
Read more >
How to Save Pandas Dataframe as gzip/zip File?
To save a Pandas dataframe as gzip file, we use 'compression=”gzip”' in addition to the filename as argument to to_csv() function. In this ......
Read more >
Writing and reading large DataFrames | Analytics with Python
This post shows a simple class that for writing .csv files in chunks, that solves two issues when using pandas DataFrame's built-in .to_csv...
Read more >
pandas.DataFrame.to_csv
Write object to a comma-separated values (csv) file. ... the following could be passed for faster compression and to create a reproducible gzip...
Read more >
Why “df.to_csv” could be a Mistake ? | by Elfao | Analytics Vidhya
All Python users know that saving data in CSV format is very ... results/df.parquet', compression='gzip') #We save same df using Parquet. So ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found