Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sum() after groupby returns different value compared to regular sum()

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

np.random.seed(0)
df = pd.DataFrame(np.random.rand(253, 2) * 254, columns=['a', 'b'])
df['type'] = 'test'

sum_mult_v1 = df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].sum()[0]
sum_mult_v2 = (df['a'] * df['b']).sum()
print(sum_mult_v1)
print(sum_mult_v2)
print(sum_mult_v1 == sum_mult_v2)

Problem description

The output of the code is:

4010049.3807103755
4010049.3807103736
False

For some reason, the summation of values after groupby is different from the same operation done without a groupby. I understand that there is no point in grouping by a column that only has one value, but I wonder if there is something off with the summation function after groupby?

Expected Output

The expected output would be having the same numbers.

4010049.3807103736
4010049.3807103736
True

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None python : 3.7.6.final.0 python-bits : 64 OS : Darwin OS-release : 19.4.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3 numpy : 1.18.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.17 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.14.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.8 numba : 0.49.1

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

jorisvandenbosschecommented, May 24, 2020

I assume in this case it would use the np.sum, and the values should be the same,

Actually not, as pandas is “too smart” here, and when it sees np.sum, it will still use the optimized cython grouped-sum instead.

But if you put the np.sum in a lambda, then pandas won’t optimize it and actually call np.sum, and then you see the result is the same as summing the full columns:

In [2]: df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].sum()[0] 
Out[2]: 4010049.3807103755

In [3]: (df['a'] * df['b']).sum() 
Out[3]: 4010049.3807103736

In [4]: df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].agg(lambda x: np.sum(x))[0]
Out[4]: 4010049.3807103736

0reactions

Denisoltcommented, May 24, 2020

Oh I see, this is awesome! Thank you for the insights, I will close the issue!