Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Suggestion: changing default `float_format` in `DataFrame.to_csv()`

See original GitHub issue

Current behavior

If I read a CSV file, do nothing with it, and save it again, I would expect Pandas to keep the format the CSV had before. But that is not the case.

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas

input_csv = StringIO('''
01/01/17 23:00,1.05148,1.05153,1.05148,1.05153,4
01/01/17 23:01,1.05153,1.05153,1.05153,1.05153,4
01/01/17 23:02,1.05170,1.05175,1.05170,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.05170,1.05170,1.05170,1.05170,4
01/01/17 23:11,1.05173,1.05174,1.05173,1.05174,4
01/01/17 23:13,1.05173,1.05173,1.05173,1.05173,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.05204,1.05238,1.05204,1.05238,4
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None)
    print(Path(tmpfile.name).read_text())

Would give us this output:

01/01/17 23:00,1.05148,1.0515299999999999,1.05148,1.0515299999999999,4
01/01/17 23:01,1.0515299999999999,1.0515299999999999,1.0515299999999999,1.0515299999999999,4
01/01/17 23:02,1.0517,1.05175,1.0517,1.05175,4
01/01/17 23:03,1.05174,1.05175,1.05174,1.05175,4
01/01/17 23:08,1.0517,1.0517,1.0517,1.0517,4
01/01/17 23:11,1.0517299999999998,1.05174,1.0517299999999998,1.05174,4
01/01/17 23:13,1.0517299999999998,1.0517299999999998,1.0517299999999998,1.0517299999999998,4
01/01/17 23:14,1.05174,1.05174,1.05174,1.05174,4
01/01/17 23:16,1.0520399999999999,1.0523799999999999,1.0520399999999999,1.0523799999999999,4

What is wrong with that?

I would consider this to be unintuitive/undesirable behavior.

The written numbers have that representation because the original number cannot be represented precisely as a float. That is expected when working with floats. However, that means we are writing the last digit, which we know it is not exact due to float-precision limitations anyways, to the CSV.

Proposed change

I think that last digit, knowing is not precise anyways, should be rounded when writing to a CSV file.

Maybe by changing the default DataFrame.to_csv()'s float_format parameter from None to '%16g'? (or at least make .to_csv() use '%.16g' when no float_format is specified).

Note that I propose rounding to the float’s precision, which for a 64-bits float, would mean that 1.0515299999999999 could be rounded to 1.05123, but 1.0515299999999992 could be rounded to 1.051529999999999 and 1.051529999999981 would not be rounded at all.

Example:

from io import StringIO
from pathlib import Path
from tempfile import NamedTemporaryFile

import pandas

input_csv = StringIO('''
1.05153,1.05175
1.0515299999999991,1.051529999999981
''')

df = pandas.read_csv(input_csv, header=None)

with NamedTemporaryFile() as tmpfile:
    df.to_csv(tmpfile.name, index=False, header=None, float_format='%.16g')
    print(Path(tmpfile.name).read_text())

Which gives:

1.05153,1.05175
1.051529999999999,1.051529999999981

Which also adds some errors, but keeps a cleaner output:

State	Loaded	Writen	Error
Before	1.05153	1.0515299999999999	0.0000000000000001
After	1.05153	1.05153	0.0
Before	1.0515299999999991	1.0515299999999992	0.0000000000000001
After	1.0515299999999991	1.051529999999999	0.0000000000000001

Note that errors are similar, but the output “After” seems to be more consistent with the input (for all the cases where the float is not represented to the last unprecise digit).

Also, I think in most cases, a CSV does not have floats represented to the last (unprecise) digit. See the precedents just bellow (other software outputting CSVs that would not use that last unprecise digit).

Precedents

Both MATLAB and R do not use that last unprecise digit when converting to CSV (they round it).

Issue Analytics

State:
Created 6 years ago
Reactions:8
Comments:24 (17 by maintainers)

Top GitHub Comments

2reactions

IngvarLacommented, May 30, 2018

Not sure if this thread is active, anyway here are my thoughts. I am not a regular pandas user, but inherited some code that uses dataframes and uses the to_csv() method. With an update of our Linux OS, we also update our python modules, and I saw this change: in pandas 0.19.2 floating point numbers were written as str(num), which has 12 digits precision, in pandas 0.22.0 they are written as repr(num) which has 17 digits precision. There is a fair bit of noise in the last digit, enough that when using different hardware the last digit can vary. The str(num) is intended for human consumption, while repr(num) is the official representation, so reasonable that repr(num) is default. Still, it would be nice if there was an option to write out the numbers with str(num) again. Makes it easier to compare output without having to use tolerances.

1reaction

janoshcommented, Aug 8, 2019

Saving a dataframe to CSV isn’t so much a computation as rather a logging operation, I think.