BUG: aggregation of np.float16/np.float32 is wrong for big dataset
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd; print(pd.__version__)
import numpy as np; print(np.__version__)
N = 70_000_000
df = pd.DataFrame({'A': np.random.normal(4,1,N).astype(np.float32)})
print(np.mean(df['A'].values)) # Return 4.0000944 <-- Correct
print(np.mean(df['A'])) # Return 1.917656660079956 <-- Wrong !
print(df['A'].mean()) # Return 1.917656660079956 <-- written like this, it looks like a pandas-related bug
Issue Description
Hi,
It seems that when using float32, pandas mess up mean() or var() function after 34 Millions of rows. I was suspecting some rounding errors, but it seems to be something way more fundamental than this.
Please note that this bug :
- is especially nasty since it does not produce warning or raise an Exception, yet gives a statistic absolutely wrong. Consequences for data pipelines and companies can be really big.
- Mathematically, it seems that all the elements after a certain index (sometimes
2**24
,2**25
…) are considered as 0 for np.float32 (or NaN for other dtype) - happen at least for np.mean() and np.var(), but probably for other functions as well
- may be, in fact, related to Numpy (or other library) and not Pandas.
In terms of datatype, I manage to reproduce the bug for np.float32
and np.float16
:
- float64 : works OK at least up to (
2**28
) - float32 : OK up to 1.99 * (
2**23
), starts bugging at (2**24
) (consider last elements as 0) - float16 : OK up to 1.99 * (
2**15
), starts bugging at (2**16
) (consider last elements as NaN) - np.int8, np.int16, np.int32, np.int64 : works OK at least up to (
2**28
)
Expected Behavior
In the above example, we should have np.mean(df[‘A’]) returning something around 4.0
Installed Versions
INSTALLED VERSIONS
commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.7.10.final.0 python-bits : 64 OS : Linux OS-release : 4.19.0-18-cloud-amd64 Version : #1 SMP Debian 4.19.208-1 (2021-09-29) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.3.5 numpy : 1.21.6 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.2.0 Cython : 0.29.30 pytest : 7.1.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : 7.28.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fsspec : 2021.10.0 fastparquet : 0.8.1 gcsfs : 2021.10.0 matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.17.4 pyarrow : 5.0.0 pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.25 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:11 (4 by maintainers)
Top GitHub Comments
I think this is caused by https://github.com/pydata/bottleneck/issues/379 and this change would mitigate the issue? (not fix) https://github.com/pydata/bottleneck/pull/407 Or a similar change to other methods.
And this is a duplicate of #42878 it looks like
This is shorter
(without pandas)
Edit: Yet it could be called a pandas bug. If bottleneck doesn’t want to fix this, pandas can’t use it outside of the area where it works well. Most telling is
bottleneck.nansum(xs) # 16777216.0
becausenp.float32(np.float32(16777216.0) + 1.) == 16777216