question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: aggregation of np.float16/np.float32 is wrong for big dataset

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd; print(pd.__version__)
import numpy as np; print(np.__version__)

N = 70_000_000
df = pd.DataFrame({'A': np.random.normal(4,1,N).astype(np.float32)})

print(np.mean(df['A'].values)) # Return 4.0000944 <-- Correct
print(np.mean(df['A'])) # Return 1.917656660079956 <-- Wrong !
print(df['A'].mean()) # Return 1.917656660079956 <-- written like this, it looks like a pandas-related bug

Issue Description

Hi,

It seems that when using float32, pandas mess up mean() or var() function after 34 Millions of rows. I was suspecting some rounding errors, but it seems to be something way more fundamental than this.

Please note that this bug :

  • is especially nasty since it does not produce warning or raise an Exception, yet gives a statistic absolutely wrong. Consequences for data pipelines and companies can be really big.
  • Mathematically, it seems that all the elements after a certain index (sometimes 2**24, 2**25 …) are considered as 0 for np.float32 (or NaN for other dtype)
  • happen at least for np.mean() and np.var(), but probably for other functions as well
  • may be, in fact, related to Numpy (or other library) and not Pandas.

In terms of datatype, I manage to reproduce the bug for np.float32 and np.float16 :

  • float64 : works OK at least up to (2**28)
  • float32 : OK up to 1.99 * (2**23), starts bugging at (2**24) (consider last elements as 0)
  • float16 : OK up to 1.99 * (2**15), starts bugging at (2**16) (consider last elements as NaN)
  • np.int8, np.int16, np.int32, np.int64 : works OK at least up to (2**28)

Expected Behavior

In the above example, we should have np.mean(df[‘A’]) returning something around 4.0

Installed Versions

INSTALLED VERSIONS

commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.7.10.final.0 python-bits : 64 OS : Linux OS-release : 4.19.0-18-cloud-amd64 Version : #1 SMP Debian 4.19.208-1 (2021-09-29) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.5 numpy : 1.21.6 pytz : 2021.3 dateutil : 2.8.2 pip : 21.2.4 setuptools : 58.2.0 Cython : 0.29.30 pytest : 7.1.1 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.0.2 IPython : 7.28.0 pandas_datareader: None bs4 : None bottleneck : 1.3.2 fsspec : 2021.10.0 fastparquet : 0.8.1 gcsfs : 2021.10.0 matplotlib : 3.4.3 numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.17.4 pyarrow : 5.0.0 pyxlsb : None s3fs : None scipy : 1.7.1 sqlalchemy : 1.4.25 tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:11 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
blusscommented, Jun 16, 2022

I think this is caused by https://github.com/pydata/bottleneck/issues/379 and this change would mitigate the issue? (not fix) https://github.com/pydata/bottleneck/pull/407 Or a similar change to other methods.

And this is a duplicate of #42878 it looks like

1reaction
blusscommented, Jun 15, 2022

This is shorter

import import numpy as np
import bottleneck
N=1.5 * (2**24)
xs = np.repeat(1., N).astype(np.float32)
np.mean(xs)  # 1.0
bottleneck.nanmean(xs)  # 0.6666666865348816

(without pandas)

Edit: Yet it could be called a pandas bug. If bottleneck doesn’t want to fix this, pandas can’t use it outside of the area where it works well. Most telling is bottleneck.nansum(xs) # 16777216.0 because np.float32(np.float32(16777216.0) + 1.) == 16777216

Read more comments on GitHub >

github_iconTop Results From Across the Web

Aggregation and Grouping | Python Data Science Handbook
In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found