question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Suggestion: changing default `float_format` in `DataFrame.to_csv()`

See original GitHub issue

Current behavior

If I read a CSV file, do nothing with it, and save it again, I would expect Pandas to keep the format the CSV had before. But that is not the case.

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas

input_csv = StringIO('''
01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4
01/01/17 23:01,1.05153,1.05153,1.05153,1.05153,4
01/01/17 23:02,1.05170,1.05175,1.05170,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.05170,1.05170,1.05170,1.05170,4
01/01/17 23:11,1.05173,1.05174,1.05173,1.05174,4
01/01/17 23:13,1.05173,1.05173,1.05173,1.05173,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.05204,1.05238,1.05204,1.05238,4
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None)
    print(Path(tmpfile.name).read_text())

Would give us this output:

01/01/17 23:00,1.05148,1.0515299999999999,1.05148,1.0515299999999999,4
01/01/17 23:01,1.0515299999999999,1.0515299999999999,1.0515299999999999,1.0515299999999999,4
01/01/17 23:02,1.0517,1.05175,1.0517,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.0517,1.0517,1.0517,1.0517,4
01/01/17 23:11,1.0517299999999998,1.05174,1.0517299999999998,1.05174,4
01/01/17 23:13,1.0517299999999998,1.0517299999999998,1.0517299999999998,1.0517299999999998,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.0520399999999999,1.0523799999999999,1.0520399999999999,1.0523799999999999,4

What is wrong with that?

I would consider this to be unintuitive/undesirable behavior.

The written numbers have that representation because the original number cannot be represented precisely as a float. That is expected when working with floats. However, that means we are writing the last digit, which we know it is not exact due to float-precision limitations anyways, to the CSV.

Proposed change

I think that last digit, knowing is not precise anyways, should be rounded when writing to a CSV file.

Maybe by changing the default DataFrame.to_csv()'s float_format parameter from None to '%16g'? (or at least make .to_csv() use '%.16g' when no float_format is specified).

Note that I propose rounding to the float’s precision, which for a 64-bits float, would mean that 1.0515299999999999 could be rounded to 1.05123, but 1.0515299999999992 could be rounded to 1.051529999999999 and 1.051529999999981 would not be rounded at all.

Example:

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas

input_csv = StringIO('''
1.05153,1.05175
1.0515299999999991,1.051529999999981
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None, float_format='%.16g')
    print(Path(tmpfile.name).read_text())

Which gives:

1.05153,1.05175
1.051529999999999,1.051529999999981

Which also adds some errors, but keeps a cleaner output:

State Loaded Writen Error
Before 1.05153 1.0515299999999999 0.0000000000000001
After 1.05153 1.05153 0.0
Before 1.0515299999999991 1.0515299999999992 0.0000000000000001
After 1.0515299999999991 1.051529999999999 0.0000000000000001

Note that errors are similar, but the output “After” seems to be more consistent with the input (for all the cases where the float is not represented to the last unprecise digit).

Also, I think in most cases, a CSV does not have floats represented to the last (unprecise) digit. See the precedents just bellow (other software outputting CSVs that would not use that last unprecise digit).

Precedents

Both MATLAB and R do not use that last unprecise digit when converting to CSV (they round it).

Issue Analytics

  • State:open
  • Created 6 years ago
  • Reactions:8
  • Comments:24 (17 by maintainers)

github_iconTop GitHub Comments

2reactions
IngvarLacommented, May 30, 2018

Not sure if this thread is active, anyway here are my thoughts. I am not a regular pandas user, but inherited some code that uses dataframes and uses the to_csv() method. With an update of our Linux OS, we also update our python modules, and I saw this change: in pandas 0.19.2 floating point numbers were written as str(num), which has 12 digits precision, in pandas 0.22.0 they are written as repr(num) which has 17 digits precision. There is a fair bit of noise in the last digit, enough that when using different hardware the last digit can vary. The str(num) is intended for human consumption, while repr(num) is the official representation, so reasonable that repr(num) is default. Still, it would be nice if there was an option to write out the numbers with str(num) again. Makes it easier to compare output without having to use tolerances.

1reaction
janoshcommented, Aug 8, 2019

Saving a dataframe to CSV isn’t so much a computation as rather a logging operation, I think.

Read more comments on GitHub >

github_iconTop Results From Across the Web

float64 with pandas to_csv - python - Stack Overflow
First, try to round to the desired decimals and then export to csv. Just try the following : df = df.astype(float).round(3) df.to_csv ......
Read more >
Pandas to_csv() - Convert DataFrame to CSV - DigitalOcean
Pandas DataFrame to_csv() function converts DataFrame into CSV data. We can pass a file object to write the CSV data into a file....
Read more >
How to Export Pandas DataFrame to CSV | by Barney H.
The default value is None, and every column will export to CSV format. If set, only columns will be exported. dt.to_csv('file_name.csv',columns ...
Read more >
Options and settings — pandas 1.5.2 documentation
reset_option() - reset one or more options to their default value. ... a floating point number and return a string with the desired...
Read more >
Overview of Pandas Data Types - Practical Business Python
Most of the time, using pandas default int64 and float64 types will work. The only reason I included in this table is that...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found