question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

sum() after groupby returns different value compared to regular sum()

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

np.random.seed(0)
df = pd.DataFrame(np.random.rand(253, 2) * 254, columns=['a', 'b'])
df['type'] = 'test'

sum_mult_v1 = df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].sum()[0]
sum_mult_v2 = (df['a'] * df['b']).sum()
print(sum_mult_v1)
print(sum_mult_v2)
print(sum_mult_v1 == sum_mult_v2)

Problem description

The output of the code is:

4010049.3807103755
4010049.3807103736
False

For some reason, the summation of values after groupby is different from the same operation done without a groupby. I understand that there is no point in grouping by a column that only has one value, but I wonder if there is something off with the summation function after groupby?

Expected Output

The expected output would be having the same numbers.

4010049.3807103736
4010049.3807103736
True

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None python : 3.7.6.final.0 python-bits : 64 OS : Darwin OS-release : 19.4.0 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3 numpy : 1.18.1 pytz : 2020.1 dateutil : 2.8.1 pip : 20.0.2 setuptools : 45.2.0.post20200210 Cython : 0.29.17 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.2 IPython : 7.14.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : None matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : 1.2.0 xlwt : None xlsxwriter : 1.2.8 numba : 0.49.1

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jorisvandenbosschecommented, May 24, 2020

I assume in this case it would use the np.sum, and the values should be the same,

Actually not, as pandas is “too smart” here, and when it sees np.sum, it will still use the optimized cython grouped-sum instead.

But if you put the np.sum in a lambda, then pandas won’t optimize it and actually call np.sum, and then you see the result is the same as summing the full columns:

In [2]: df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].sum()[0] 
Out[2]: 4010049.3807103755

In [3]: (df['a'] * df['b']).sum() 
Out[3]: 4010049.3807103736

In [4]: df.assign(mult=(lambda x: x.a * x.b)).groupby('type')['mult'].agg(lambda x: np.sum(x))[0]
Out[4]: 4010049.3807103736
0reactions
Denisoltcommented, May 24, 2020

Oh I see, this is awesome! Thank you for the insights, I will close the issue!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas Groupby and Sum Only One Column - Stack Overflow
The only way to do this would be to include C in your groupby (the groupby function can accept a list). Give this...
Read more >
BUG: sum vs groupby.sum errors · Issue #38778 - GitHub
I know about floating point math and small associated "errors" but the "cont" column has an unique, not-null value, and this means all...
Read more >
Pandas Groupby and Sum - GeeksforGeeks
sum() function returns the sum of the values for the requested axis. If the input is the index axis then it adds all...
Read more >
Pandas groupby(), count(), sum() and Other Aggregation ...
It's just grouping similar values and calculating the given aggregate value (in the above example it was a mean value) for each group....
Read more >
Pandas groupby() and sum() With Examples
sum() to group rows based on one or multiple columns and calculate sum agg function. groupby() function returns a DataFrameGroupBy object which ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found