Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float'

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

# Series: Result 608202560.0
pd.to_numeric(pd.Series([608202549]), downcast='float')

#Series: Result 608202560.0
pd.to_numeric(pd.Series([608202549.0]), downcast='float')

# Single: Result 608202549
pd.to_numeric(608202549, downcast='float')

# List: Result array([608202560, 608202560])
pd.to_numeric([608202548, 608202549], downcast='float').astype(int)

# Range: 1st issue at 16777217
arr = np.arange(0, 6e8, dtype='int64')
a = pd.Series(arr)
b = pd.to_numeric(a, downcast='float').astype(int)
np.where(a != b)[0]

Issue Description

When attempting to use to_numeric() on a Pandas Series or list with values >= 16777217, the resulting float is not expected given the input number.

pd.to_numeric(pd.Series([608202549]), downcast='float')
0    608202560.0 

>>> pd.to_numeric(pd.Series([608202549.0]), downcast='float')
0    608202560.0

This behavior does not occur if the value is passed as a scalar, but does also appear if a List is passed. By comparing values within a range, it seems the issue starts from 16,777,217 onward. This seems to have roots in the discussion about 32 bit floating point in this Stack Overflow as well as another Pandas issue on float issues with rank #18271

Edit- I updated this issue to be clear that this is affecting both integers and floats larger than 16,777,217

Expected Behavior

to_numeric() should yield the same functional number in float form as the original value(s)

Installed Versions

pd.show_versions()

INSTALLED VERSIONS

commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.9.4.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.3 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.0.1 setuptools : 54.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

State:
Created 2 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

Liam3851commented, Sep 23, 2021

Got it on the atol, I was having a brain freeze and interpreting as rtol. I see your point.

I don’t see how the confusion in the original problem is avoidable, unless downcast were restricted to only downcast values between -16777216 and 16777215.

I think this is what the PR is proposing since the function has the same restrictions on the Int side. If the int does not fit, the next size up is used. Similarly, if the float does not fit, then the next size up should be used.

Got it. To be clear this changes behavior for people who are trying (legitimately) to store, say, 3.14e12 as a 32-bit float; though I do see where this eliminates a footgun for other (possibly/probably more numerous) users.

1reaction

jbencinacommented, Sep 23, 2021

In your example, you didn’t set the rtol parameter to 0.0 like in the PR. The default value is 1e-05, which at the scale of the example 6082025490 equates to +/- 60820. The PR only considers the absolute difference. So in your example:

a = np.array([6082025490, 6082025491], dtype=np.float64)
b = a.astype(np.float32)

 np.allclose(a, b, atol=5e-8, rtol=0.0)
> False

Indeed this is what the PR is doing, preventing the downcast because these are not downcastable beyond an expected precision tolerance

s = pd.Series([6082025490, 6082025491], dtype=np.float64)
pd.to_numeric(s, downcast='float').dtype # With PR applied
> dtype('float64')

I don’t see how the confusion in the original problem is avoidable, unless downcast were restricted to only downcast values between -16777216 and 16777215.

I think this is what the PR is proposing since the function has the same restrictions on the Int side. If the int does not fit, the next size up is used. Similarly, if the float does not fit, then the next size up should be used.