sum() after groupby returns different value compared to regular sum()
See original GitHub issue-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
np.random.seed(0)
df = pd.DataFrame(np.random.rand(253, 2) * 254, columns=['a', 'b'])
df['type'] = 'test'
sum_mult_v1 = df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].sum()[0]
sum_mult_v2 = (df['a'] * df['b']).sum()
print(sum_mult_v1)
print(sum_mult_v2)
print(sum_mult_v1 == sum_mult_v2)
Problem description
The output of the code is:
4010049.3807103755
4010049.3807103736
False
For some reason, the summation of values after groupby is different from the same operation done without a groupby. I understand that there is no point in grouping by a column that only has one value, but I wonder if there is something off with the summation function after groupby?
Expected Output
The expected output would be having the same numbers.
4010049.3807103736
4010049.3807103736
True
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None python : 3.7.6.final.0 python-bits : 64 OS : Darwin OS-release : 19.4.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.0.3 numpy : 1.18.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.17 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.14.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.8 numba : 0.49.1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:6 (3 by maintainers)
Top GitHub Comments
Actually not, as pandas is “too smart” here, and when it sees
np.sum
, it will still use the optimized cython grouped-sum instead.But if you put the
np.sum
in a lambda, then pandas won’t optimize it and actually callnp.sum
, and then you see the result is the same as summing the full columns:Oh I see, this is awesome! Thank you for the insights, I will close the issue!