Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

The serialization of a DataFrame with floats automatically casts floats into integers

See original GitHub issue

def _dump_fn

import pandas as pd

df = pd.Dataframe({
                        'col_1': [1.0, 2.0, 3.0], 
                        'col_2': [21.0, 31.0, 32.0],
                        'col_3': [101.0, 102.0, 103.0]
         })

df.to_csv("test.csv", float_format=f"%.17g")       

df
#   col_1  col_2  col_3
# 0    1.0   21.0  101.0
# 1    2.0   31.0  102.0
# 2    3.0   32.0  103.0

df.dtypes
# col_1    float64
# col_2    float64
# col_3    float64
# dtype: object

The serialized csv:

,col_1,col_2,col_3
0,1,21,101
1,2,31,102
2,3,32,103

https://docs.python.org/3/library/string.html#format-specification-mini-language

According to the python docs the presentation type ‘g’ to format strings as a flaw.

… In both cases insignificant trailing zeros are removed from the significand, and the decimal point is also removed if there are no remaining digits following it, unless the ‘#’ option is used. …

Any particular reason why the scientific notation is not used? ‘e’

Scientific notation. For a given precision p, formats the number in scientific notation with the letter ‘e’ separating the coefficient from the exponent. The coefficient has one digit before and p digits after the decimal point, for a total of p + 1 significant digits. With no precision given, uses a precision of 6 digits after the decimal point for float, and shows all coefficient digits for Decimal. If no digits follow the decimal point, the decimal point is also removed unless the # option is used.

Issue Analytics

State:
Created a year ago
Comments:14 (11 by maintainers)

Top GitHub Comments

1reaction

tadeucommented, Jun 24, 2022

I think that if is isclose is consistent for float’s and int’s, then we shouldn’t have the (c.2) problem. It doesn’t matter that the “integer floats” got converted to int, they would be compared in the same way as if they were float.

I don’t find having atol and rtol being floats that weird, it seems to be very practical, you can check that 10**30+1 and 10**30 are within a 1e-30 relative tolerance or 1.0 absolute tolerance without problems.

I agree with most of your points, they are all nice contributions! My main concerns here are on backward compatibility/consistency, and to avoid the noise on modifying all datasets.

1reaction

tarcisiofischercommented, Jun 24, 2022

I’m unsure.

(a) Someone saves pd.DataFrame({‘col_1’: [1.0, 2.0, 3.0], }) those are meant to be floats.

(b) Someone loads it again. Since this PR is not merged, they are now integers.

© Since they are integers, they’ll be handled as integer comparison. (c.1) IF you decide to use np.isclose, then integers will be converted back to float. I find this path weird, but perhaps it is guarantee that no bugs will arise from that. But then again, having atol and rtol being floats is weird for integer comparison or (c.2) IF you decide to handle integers in a different way, then your floats are now saved as integers and integer comparison apply, which (for me) would be unexpected.

I’m really unsure if I’m missing something here. I tried to put the path “on paper” to check it myself, but isn’t it what it would happen?

Perhaps (c.1) is the best path to take (and is what was been tried to be merged in the first place). If all agree on that, I won’t oppose - I just wanted to be sure edge cases would be handled correctly, such as if I have numbers that are not representable as integers, or the other way around, or if I have tolerances that may truncate up vs truncate down when comparing to integers, etc.

Top Results From Across the Web

How to Convert Floats to Integers in Pandas DataFrame

You can convert floats to integers in Pandas DataFrame using: (1) astype(int): df['DataFrame Column'] = df['DataFrame Column'].astype(int).

append pandas dataframe automatically cast as float but want ...

It's because your initial dataframe is empty. Initialize it with some integer column. df = pd.DataFrame(dict(A=[], test=[], another=[]), dtype=int) ...

Convert Floats to Integers in a Pandas DataFrame

Let us see how to convert float to integer in a Pandas DataFrame. We will be using the astype() method to do this....

What's new in 1.4.0 (January 22, 2022) - Pandas

Now the float-dtype is respected. Since the common dtype for these DataFrames is object, the np.nan is retained. New behavior: >> ...

Migration Guide: SQL, Datasets and DataFrame - Apache Spark

In Spark 3.2, the auto-generated Cast (such as those added by type coercion rules) will be stripped ... In Spark 3.2, FloatType is...