question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DataFrame to_csv line_terminator inconsistency when using compression

See original GitHub issue

Code Sample, a copy-pastable example if possible

df.to_csv('uncompressed.csv')
df.to_csv('compressed-wrong-line-terminator.csv.gz')
df.to_csv('compressed-good-line-terminator.csv.gz', line_terminator='\n')

Problem description

Current line_terminator defaults when using compression and when not using compression are different (Windows OS, pandas 0.24.1).

When uncompressing the gzip file created using the default line_terminator, we can clearly see that the files are different (compressed-wrong-line-terminator.csv vs uncompressed.csv); only when using the explicit line_termintor=‘\n’ the uncompressed file is identical to the not compressed file (compressed-good-line-terminator.csv.gz vs. uncompressed.csv).

It is emphasized that if we use the explicit line_terminator=‘\n’ for non-compressed files, the output file is different than the ones created without explicit assignment of the line_terminator - forcing the user the need to explicitly specify the line_terminator only for compressed files.

This behavior is problematic, especially using the latest pandas version, where compression is inferred from the file extension, and one would expect that also the line_separator will undergo the same inference.

Expected Output

As stated above, it is expected that the command in python line 2 (after uncompressing it) will produce the same file as produced by the command in python line 1. However, we see that only the command in python line 3 (after uncompressing it) produces the same file as produced by the command in python line 1.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.24.1 pytest: None pip: 19.0.1 setuptools: 40.4.3 Cython: None numpy: 1.15.2 scipy: 1.0.0 pyarrow: None xarray: None IPython: 6.2.1 sphinx: None patsy: 0.5.0 dateutil: 2.6.1 pytz: 2017.3 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.0 openpyxl: 2.5.12 xlrd: 1.2.0 xlwt: None xlsxwriter: None lxml.etree: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jointfullcommented, Mar 11, 2019

👏

0reactions
gfyoungcommented, Mar 6, 2019

I am pretty certain that these two issues are tied to the same underlying issue, so let’s see what comes of #25048 and then return to this one if need be.

Read more comments on GitHub >

github_iconTop Results From Across the Web

problem with line terminator \n on dataframe and .csv
Im getting (with an python API) a .csv file from an email attachment that i received in gmail, transforming it into a dataframe...
Read more >
IO tools (text, CSV, HDF5, …) — pandas 1.5.2 documentation
In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent dataset. If you rely on pandas...
Read more >
pandas.DataFrame.to_csv — pandas 1.5.2 documentation
Character used to quote fields. lineterminatorstr, optional. The newline character or character sequence to use in the output file. Defaults to os.linesep ...
Read more >
pandas.DataFrame.to_csv
Character used to quote fields. lineterminatorstr, optional. The newline character or character sequence to use in the output file. Defaults to os.linesep ...
Read more >
pandas.DataFrame.to_csv — pandas 0.18.1 documentation
a string representing the compression to use in the output file, allowed values are 'gzip', 'bz2', 'xz', only used when the first argument...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found