The serialization of a DataFrame with floats automatically casts floats into integers
See original GitHub issueimport pandas as pd
df = pd.Dataframe({
'col_1': [1.0, 2.0, 3.0],
'col_2': [21.0, 31.0, 32.0],
'col_3': [101.0, 102.0, 103.0]
})
df.to_csv("test.csv", float_format=f"%.17g")
df
# col_1 col_2 col_3
# 0 1.0 21.0 101.0
# 1 2.0 31.0 102.0
# 2 3.0 32.0 103.0
df.dtypes
# col_1 float64
# col_2 float64
# col_3 float64
# dtype: object
The serialized csv:
,col_1,col_2,col_3
0,1,21,101
1,2,31,102
2,3,32,103
https://docs.python.org/3/library/string.html#format-specification-mini-language
According to the python docs the presentation type ‘g’ to format strings as a flaw.
… In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it, unless the ‘#’ option is used. …
Any particular reason why the scientific notation is not used? ‘e’
Scientific notation. For a given precision p, formats the number in scientific notation with the letter ‘e’ separating the coefficient from the exponent. The coefficient has one digit before and p digits after the decimal point, for a total of p + 1 significant digits. With no precision given, uses a precision of 6 digits after the decimal point for float, and shows all coefficient digits for Decimal. If no digits follow the decimal point, the decimal point is also removed unless the # option is used.
Issue Analytics
- State:
- Created a year ago
- Comments:14 (11 by maintainers)
Top GitHub Comments
I think that if is
isclose
is consistent forfloat
’s andint
’s, then we shouldn’t have the (c.2) problem. It doesn’t matter that the “integer floats” got converted toint
, they would be compared in the same way as if they werefloat
.I don’t find having
atol
andrtol
being floats that weird, it seems to be very practical, you can check that10**30+1
and10**30
are within a1e-30
relative tolerance or1.0
absolute tolerance without problems.I agree with most of your points, they are all nice contributions! My main concerns here are on backward compatibility/consistency, and to avoid the noise on modifying all datasets.
I’m unsure.
(a) Someone saves pd.DataFrame({‘col_1’: [1.0, 2.0, 3.0], }) those are meant to be floats.
(b) Someone loads it again. Since this PR is not merged, they are now integers.
© Since they are integers, they’ll be handled as integer comparison. (c.1) IF you decide to use np.isclose, then integers will be converted back to float. I find this path weird, but perhaps it is guarantee that no bugs will arise from that. But then again, having
atol
andrtol
being floats is weird for integer comparison or (c.2) IF you decide to handle integers in a different way, then your floats are now saved as integers and integer comparison apply, which (for me) would be unexpected.I’m really unsure if I’m missing something here. I tried to put the path “on paper” to check it myself, but isn’t it what it would happen?
Perhaps (c.1) is the best path to take (and is what was been tried to be merged in the first place). If all agree on that, I won’t oppose - I just wanted to be sure edge cases would be handled correctly, such as if I have numbers that are not representable as integers, or the other way around, or if I have tolerances that may truncate up vs truncate down when comparing to integers, etc.