Suggestion: changing default `float_format` in `DataFrame.to_csv()`
See original GitHub issueCurrent behavior
If I read a CSV file, do nothing with it, and save it again, I would expect Pandas to keep the format the CSV had before. But that is not the case.
from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile
import pandas
input_csv = StringIO('''
01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4
01/01/17 23:01,1.05153,1.05153,1.05153,1.05153,4
01/01/17 23:02,1.05170,1.05175,1.05170,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.05170,1.05170,1.05170,1.05170,4
01/01/17 23:11,1.05173,1.05174,1.05173,1.05174,4
01/01/17 23:13,1.05173,1.05173,1.05173,1.05173,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.05204,1.05238,1.05204,1.05238,4
''')
df = pandas.read_csv(input_csv, header=None)
with NamedTemporaryFile() as tmpfile:
df.to_csv(tmpfile.name, index=False, header=None)
print(Path(tmpfile.name).read_text())
Would give us this output:
01/01/17 23:00,1.05148,1.0515299999999999,1.05148,1.0515299999999999,4
01/01/17 23:01,1.0515299999999999,1.0515299999999999,1.0515299999999999,1.0515299999999999,4
01/01/17 23:02,1.0517,1.05175,1.0517,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.0517,1.0517,1.0517,1.0517,4
01/01/17 23:11,1.0517299999999998,1.05174,1.0517299999999998,1.05174,4
01/01/17 23:13,1.0517299999999998,1.0517299999999998,1.0517299999999998,1.0517299999999998,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.0520399999999999,1.0523799999999999,1.0520399999999999,1.0523799999999999,4
What is wrong with that?
I would consider this to be unintuitive/undesirable behavior.
The written numbers have that representation because the original number cannot be represented precisely as a float. That is expected when working with floats. However, that means we are writing the last digit, which we know it is not exact due to float-precision limitations anyways, to the CSV.
Proposed change
I think that last digit, knowing is not precise anyways, should be rounded when writing to a CSV file.
Maybe by changing the default DataFrame.to_csv()
's float_format
parameter from None
to '%16g'
? (or at least make .to_csv()
use '%.16g'
when no float_format
is specified).
Note that I propose rounding to the float’s precision, which for a 64-bits float, would mean that 1.0515299999999999
could be rounded to 1.05123
, but 1.0515299999999992
could be rounded to 1.051529999999999
and 1.051529999999981
would not be rounded at all.
Example:
from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile
import pandas
input_csv = StringIO('''
1.05153,1.05175
1.0515299999999991,1.051529999999981
''')
df = pandas.read_csv(input_csv, header=None)
with NamedTemporaryFile() as tmpfile:
df.to_csv(tmpfile.name, index=False, header=None, float_format='%.16g')
print(Path(tmpfile.name).read_text())
Which gives:
1.05153,1.05175
1.051529999999999,1.051529999999981
Which also adds some errors, but keeps a cleaner output:
State | Loaded | Writen | Error |
---|---|---|---|
Before | 1.05153 | 1.0515299999999999 | 0.0000000000000001 |
After | 1.05153 | 1.05153 | 0.0 |
Before | 1.0515299999999991 | 1.0515299999999992 | 0.0000000000000001 |
After | 1.0515299999999991 | 1.051529999999999 | 0.0000000000000001 |
Note that errors are similar, but the output “After” seems to be more consistent with the input (for all the cases where the float is not represented to the last unprecise digit).
Also, I think in most cases, a CSV does not have floats represented to the last (unprecise) digit. See the precedents just bellow (other software outputting CSVs that would not use that last unprecise digit).
Precedents
Both MATLAB and R do not use that last unprecise digit when converting to CSV (they round it).
Issue Analytics
- State:
- Created 6 years ago
- Reactions:8
- Comments:24 (17 by maintainers)
Top GitHub Comments
Not sure if this thread is active, anyway here are my thoughts. I am not a regular pandas user, but inherited some code that uses dataframes and uses the to_csv() method. With an update of our Linux OS, we also update our python modules, and I saw this change: in pandas 0.19.2 floating point numbers were written as str(num), which has 12 digits precision, in pandas 0.22.0 they are written as repr(num) which has 17 digits precision. There is a fair bit of noise in the last digit, enough that when using different hardware the last digit can vary. The str(num) is intended for human consumption, while repr(num) is the official representation, so reasonable that repr(num) is default. Still, it would be nice if there was an option to write out the numbers with str(num) again. Makes it easier to compare output without having to use tolerances.
Saving a dataframe to CSV isn’t so much a computation as rather a logging operation, I think.