question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation

See original GitHub issue
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Adapted from “Automatic exclusion of nuisance columns” in the User Guide “Group by” docs:

from decimal import Decimal

import pandas as pd


df_dec = pd.DataFrame(
    {
        "id": [1, 2, 1, 2],
        "int_column": [1, 2, 3, 4],
        "dec_column": [
            Decimal("0.50"),
            Decimal("0.15"),
            Decimal("0.25"),
            Decimal("0.40"),
        ],
    }
)

print('\ndf_dec.groupby(["id"])[["dec_column"]].sum()')
print('According to docs this should sum correctly')
print('It works for pandas 1.2.x but generates an empty dataframe for 1.3.x')
print(df_dec.groupby(["id"])[["dec_column"]].sum())

print('\ndf_dec.groupby(["id"])[["int_column", "dec_column"]].sum()')
print('This drops `dec_column` as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"])[["int_column", "dec_column"]].sum())

print('\ndf_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})')
print('This aggregates everything correctly as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"}))

Problem description

The User Guide “Group by” docs provides a code example that shows when nuisance columns will be excluded from aggregation. According to this doc the case:

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce a valid aggregation, but for pandas >= 1.3.0 it results in an empty dataframe.

The impact of the regression can even be seen in the published docs. If we look an archive.org 2021-02-25 snapshot of the “Automatic exclusion of nuisance columns” section, we can see that the example produces correct output (see Out[170] in the code example): https://web.archive.org/web/20210225195813/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

By contrast the 2021-08-24 snapshot displays an empty dataframe for the Out[170] example: https://web.archive.org/web/20210824151314/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

However, note that the docs in both cases indicate that the example should produce a correct aggregation.

Expected Output

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce the result:

   dec_column
id           
1        0.75
2        0.55

For pandas 1.2.x it does so as expected. For pandas >= 1.3.0 it produces instead the incorrect

Empty DataFrame
Columns: []
Index: [1, 2]

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
simonjayhawkinscommented, Sep 3, 2021

I’ll leave this open til the docs are fixed.

1reaction
simonjayhawkinscommented, Sep 3, 2021

However I think highlighting the discrepancy in the User Guide “Group by” docs may be new, and should be resolved as part of that work?

Thanks for spotting that.

The docs are automatically generated so once the regression is fixed, the doc issue should also be fixed.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python: Pandas wrongly excluding column in groupby
It claims that it silently excludes columns if the aggregate function cannot be applied to the column. Consider the following example: I have...
Read more >
What's new in 0.25.0 (July 18, 2019)
pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation ...
Read more >
Aggregations, Aggregations, Aggregations! Part 1
This of course is true only if we do not exclude the rows with missing values in the selected column. The Aggregation tabs...
Read more >
SQL GROUP By and the "Column 'name' is invalid in ...
Why do I get an error "Column 'name' is invalid in the select list because t is not contained in either an aggregate...
Read more >
GroupBy — cudf 22.10.00 documentation
Automatic exclusion of columns with unsupported dtypes (“nuisance” columns) when aggregating. Iterating over the groups of a GroupBy object.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found