question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: np.mean(pd.Series) != np.mean(pd.Series.values)

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)

assert isinstance(np.mean(a), float)
assert isinstance(np.mean(a.values), np.float32)
assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4

Problem description

  1. pd.DataFrame.mean/pd.Series.mean/np.mean(pd.Series) outputs a Python float instead of a numpy float. Since np.mean(pd.Series.values) does return an np float, I’m assuming for now that this should be fixed in pandas
  2. if dtype==np.float32, then calling mean on a pandas object gives a significantly different result vs calling mean on the underlying numpy ndarray.

Expected Output

The output of np.mean(a) should be the same as np.mean(a.values).

additional tests

# both b and c ~1e-2
b = a.mean() # the pandas impl of mean
assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64

h = np.mean(a)
assert isinstance(h, float)
assert h == b

c = a.values.mean() # the numpy impl of mean
assert isinstance(c, np.float32) # as exprected

print('\nerrors between pandas mean and numpy mean')
print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4
print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6

print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4
print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6

d = a.sum() / len(a) 
assert isinstance(d, np.float64) # expected, because division. Note `sum` returns an np.float32

e = a.values.sum() / len(a)
assert isinstance(e, np.float64) # expected, because division

# these methods are equivalent
assert d==e

# and up to f32 precision equal to the numpy impl
assert d.astype(np.float32) == c

# the cherry on the cake
f = a.astype(np.float64).mean()
assert isinstance(f, float) # still not ideal, should be np.float64

g = a.astype(np.float64).values.mean()
print('\nrelative error between pandas f64 mean and numpy f64 mean')
print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality

print('\nerrors between pandas f64 mean and numpy/pandas f32 mean')
print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4
print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9

# finally...
h = np.mean(a)
assert isinstance(h, float)
assert h == b

output


errors between pandas mean and numpy mean
relative error: 5.210e-04
absolute error: 5.204e-06
relative error after casting: 5.210e-04
absolute error after casting: 5.204e-06

relative error between pandas f64 mean and numpy f64 mean
relative error numpy f64/pandas f64: 1.066e-14

errors between pandas f64 mean and numpy/pandas f32 mean
relative error pandas f32/pandas f64: 5.214e-04
absolute error numpy f32/pandas f64: 2.399e-07

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-80-generic Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.1 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 52.0.0.post20210125 Cython : 0.29.23 pytest : 6.2.3 hypothesis : None sphinx : 4.0.1 blosc : None feather : None xlsxwriter : 1.3.8 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : 2.8.6 (dt dec pq3 ext lo64) jinja2 : 3.0.0 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : 0.9.0 fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : 1.4.15 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.51.2

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:14 (13 by maintainers)

github_iconTop GitHub Comments

3reactions
krachyoncommented, Aug 6, 2021

Just to add my two cents here as I’ve hit the issue from astropy: I really feel that to opportunistically choose an implementation with different behavior based on which library I happen to have installed is a pretty unpleasant trap. Especially when the bottleneck can even fall back to numpy under certain conditions (see https://github.com/astropy/astropy/issues/11492 for details). The choice is totally hidden from the user and most likely they won’t even think about looking for an issue like this before having grown a bunch of grey hair due to inexplicable inconsistent results, which just looking at the issues linked here has happened to at least three people independently now. The principle of least surprise for the user here would definitely be to match the behavior of numpy.

To be clear I would see this mostly as an issue with bottleneck, as it advertises itself as a drop-in replacement for numpy-routines without explicit and obvious mention of this discrepancy. So Imho the choice of using bottleneck should have to be an explicit opt-in for users that really need that last bit of performance and actively decide to sacrifice accuracy for it.

1reaction
sebasvcommented, Aug 5, 2021

Thank you for linking, I wasn’t aware and I am a bit surprised these issues got closed without a pandas-side solution. There the question was raised what pandas should do about a problem in a third-party library. Shouldn’t the answer be “do not use that library”? Their routines produce arbitrarily large errors, I don’t see how that can be defended. If I choose to use f32 precision on values around 1 I expect to get answers to be precise up to ~1e-6, and so, I imagine, do most other pandas users.

Read more comments on GitHub >

github_iconTop Results From Across the Web

mean from pandas and numpy differ - Stack Overflow
Part of the issue is that Pandas is using a poor algorithm to compute the mean; eventually, as the sum accumulates, a value...
Read more >
pandas - Get Average pd.DataFrame.mean() - Data Independent
Pandas Mean - Get the average of your data across a specified axis. You can get the mean across a DataFrame's rows or...
Read more >
pandas.Series.mean — pandas 1.5.2 documentation
Return the mean of the values over the requested axis. ... If the axis is a MultiIndex (hierarchical), count along a particular level,...
Read more >
pandas.DataFrame.mean() Examples
DataFrame.mean() function is used to get the mean of the values over the requested axis in pandas. This by default returns a Series, ......
Read more >
Creating a Pandas Series - GeeksforGeeks
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found