BUG: to_numeric incorrectly converting values >= 16,777,217 with downcast='float'
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
# Series: Result 608202560.0
pd.to_numeric(pd.Series([608202549]), downcast='float')
#Series: Result 608202560.0
pd.to_numeric(pd.Series([608202549.0]), downcast='float')
# Single: Result 608202549
pd.to_numeric(608202549, downcast='float')
# List: Result array([608202560, 608202560])
pd.to_numeric([608202548, 608202549], downcast='float').astype(int)
# Range: 1st issue at 16777217
arr = np.arange(0, 6e8, dtype='int64')
a = pd.Series(arr)
b = pd.to_numeric(a, downcast='float').astype(int)
np.where(a != b)[0]
Issue Description
When attempting to use to_numeric()
on a Pandas Series or list with values >= 16777217, the resulting float is not expected given the input number.
pd.to_numeric(pd.Series([608202549]), downcast='float')
0 608202560.0
>>> pd.to_numeric(pd.Series([608202549.0]), downcast='float')
0 608202560.0
This behavior does not occur if the value is passed as a scalar, but does also appear if a List is passed. By comparing values within a range, it seems the issue starts from 16,777,217 onward. This seems to have roots in the discussion about 32 bit floating point in this Stack Overflow as well as another Pandas issue on float issues with rank #18271
Edit- I updated this issue to be clear that this is affecting both integers and floats larger than 16,777,217
Expected Behavior
to_numeric()
should yield the same functional number in float form as the original value(s)
Installed Versions
pd.show_versions()
INSTALLED VERSIONS
commit : 73c68257545b5f8530b7044f56647bd2db92e2ba python : 3.9.4.final.0 python-bits : 64 OS : Darwin OS-release : 20.6.0 Version : Darwin Kernel Version 20.6.0: Wed Jun 23 00:26:27 PDT 2021; root:xnu-7195.141.2~5/RELEASE_ARM64_T8101 machine : arm64 processor : arm byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.3.3 numpy : 1.21.2 pytz : 2021.1 dateutil : 2.8.2 pip : 21.0.1 setuptools : 54.2.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (9 by maintainers)
Got it on the atol, I was having a brain freeze and interpreting as rtol. I see your point.
Got it. To be clear this changes behavior for people who are trying (legitimately) to store, say, 3.14e12 as a 32-bit float; though I do see where this eliminates a footgun for other (possibly/probably more numerous) users.
In your example, you didn’t set the
rtol
parameter to0.0
like in the PR. The default value is1e-05
, which at the scale of the example6082025490
equates to +/-60820
. The PR only considers the absolute difference. So in your example:Indeed this is what the PR is doing, preventing the downcast because these are not downcastable beyond an expected precision tolerance
I think this is what the PR is proposing since the function has the same restrictions on the Int side. If the int does not fit, the next size up is used. Similarly, if the float does not fit, then the next size up should be used.