BUG: np.mean(pd.Series) != np.mean(pd.Series.values)
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import pandas as pd
import numpy as np
a = pd.Series(np.random.normal(scale=0.1, size=(1_000_000,)).astype(np.float32)).pow(2)
assert isinstance(np.mean(a), float)
assert isinstance(np.mean(a.values), np.float32)
assert abs(1 - np.mean(a)/np.mean(a.values)) > 4e-4
Problem description
pd.DataFrame.mean
/pd.Series.mean
/np.mean(pd.Series)
outputs a Python float instead of a numpy float. Sincenp.mean(pd.Series.values)
does return an np float, I’m assuming for now that this should be fixed in pandas- if
dtype==np.float32
, then callingmean
on a pandas object gives a significantly different result vs callingmean
on the underlying numpy ndarray.
Expected Output
The output of np.mean(a)
should be the same as np.mean(a.values)
.
additional tests
# both b and c ~1e-2
b = a.mean() # the pandas impl of mean
assert isinstance(b, float) # PYTHON float, not numpy float? Ergo implicit f64
h = np.mean(a)
assert isinstance(h, float)
assert h == b
c = a.values.mean() # the numpy impl of mean
assert isinstance(c, np.float32) # as exprected
print('\nerrors between pandas mean and numpy mean')
print(f'relative error: {abs(1-b/c):.3e}') # ~ 5e-4
print(f'absolute error: {abs(b -c):.3e}') # ~ 5e-6
print(f'relative error after casting: {abs(1-np.float32(b)/c):.3e}') # ~ 5e-4
print(f'absolute error after casting: {abs(np.float32(b) -c):.3e}') # ~ 5e-6
d = a.sum() / len(a)
assert isinstance(d, np.float64) # expected, because division. Note `sum` returns an np.float32
e = a.values.sum() / len(a)
assert isinstance(e, np.float64) # expected, because division
# these methods are equivalent
assert d==e
# and up to f32 precision equal to the numpy impl
assert d.astype(np.float32) == c
# the cherry on the cake
f = a.astype(np.float64).mean()
assert isinstance(f, float) # still not ideal, should be np.float64
g = a.astype(np.float64).values.mean()
print('\nrelative error between pandas f64 mean and numpy f64 mean')
print(f'relative error numpy f64/pandas f64: {abs(1-g/f):.3e}') # ~ 1e-14 -- 1e-16, not bad but I would have expected equality
print('\nerrors between pandas f64 mean and numpy/pandas f32 mean')
print(f'relative error pandas f32/pandas f64: {abs(1-b/f):.3e}') # ~ 5e-4
print(f'absolute error numpy f32/pandas f64: {abs(1-c/f):.3e}') # ~ 1e-7 -- 1e-9
# finally...
h = np.mean(a)
assert isinstance(h, float)
assert h == b
output
errors between pandas mean and numpy mean
relative error: 5.210e-04
absolute error: 5.204e-06
relative error after casting: 5.210e-04
absolute error after casting: 5.204e-06
relative error between pandas f64 mean and numpy f64 mean
relative error numpy f64/pandas f64: 1.066e-14
errors between pandas f64 mean and numpy/pandas f32 mean
relative error pandas f32/pandas f64: 5.214e-04
absolute error numpy f32/pandas f64: 2.399e-07
Output of pd.show_versions()
INSTALLED VERSIONS
commit : c7f7443c1bad8262358114d5e88cd9c8a308e8aa python : 3.8.3.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-80-generic Version : #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.3.1 numpy : 1.21.1 pytz : 2021.1 dateutil : 2.8.1 pip : 21.1.1 setuptools : 52.0.0.post20210125 Cython : 0.29.23 pytest : 6.2.3 hypothesis : None sphinx : 4.0.1 blosc : None feather : None xlsxwriter : 1.3.8 lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : 2.8.6 (dt dec pq3 ext lo64) jinja2 : 3.0.0 IPython : 7.22.0 pandas_datareader: None bs4 : 4.9.3 bottleneck : 1.3.2 fsspec : 0.9.0 fastparquet : None gcsfs : None matplotlib : 3.3.4 numexpr : 2.7.3 odfpy : None openpyxl : 3.0.7 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : 1.6.2 sqlalchemy : 1.4.15 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : 0.51.2
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (13 by maintainers)
Just to add my two cents here as I’ve hit the issue from astropy: I really feel that to opportunistically choose an implementation with different behavior based on which library I happen to have installed is a pretty unpleasant trap. Especially when the bottleneck can even fall back to numpy under certain conditions (see https://github.com/astropy/astropy/issues/11492 for details). The choice is totally hidden from the user and most likely they won’t even think about looking for an issue like this before having grown a bunch of grey hair due to inexplicable inconsistent results, which just looking at the issues linked here has happened to at least three people independently now. The principle of least surprise for the user here would definitely be to match the behavior of numpy.
To be clear I would see this mostly as an issue with bottleneck, as it advertises itself as a drop-in replacement for numpy-routines without explicit and obvious mention of this discrepancy. So Imho the choice of using bottleneck should have to be an explicit opt-in for users that really need that last bit of performance and actively decide to sacrifice accuracy for it.
Thank you for linking, I wasn’t aware and I am a bit surprised these issues got closed without a pandas-side solution. There the question was raised what pandas should do about a problem in a third-party library. Shouldn’t the answer be “do not use that library”? Their routines produce arbitrarily large errors, I don’t see how that can be defended. If I choose to use f32 precision on values around 1 I expect to get answers to be precise up to ~1e-6, and so, I imagine, do most other pandas users.