Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation

See original GitHub issue

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

Adapted from “Automatic exclusion of nuisance columns” in the User Guide “Group by” docs:

from decimal import Decimal

import pandas as pd


df_dec = pd.DataFrame(
    {
        "id": [1, 2, 1, 2],
        "int_column": [1, 2, 3, 4],
        "dec_column": [
            Decimal("0.50"),
            Decimal("0.15"),
            Decimal("0.25"),
            Decimal("0.40"),
        ],
    }
)

print('\ndf_dec.groupby(["id"])[["dec_column"]].sum()')
print('According to docs this should sum correctly')
print('It works for pandas 1.2.x but generates an empty dataframe for 1.3.x')
print(df_dec.groupby(["id"])[["dec_column"]].sum())

print('\ndf_dec.groupby(["id"])[["int_column", "dec_column"]].sum()')
print('This drops `dec_column` as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"])[["int_column", "dec_column"]].sum())

print('\ndf_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})')
print('This aggregates everything correctly as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"}))

Problem description

The User Guide “Group by” docs provides a code example that shows when nuisance columns will be excluded from aggregation. According to this doc the case:

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce a valid aggregation, but for pandas >= 1.3.0 it results in an empty dataframe.

The impact of the regression can even be seen in the published docs. If we look an archive.org 2021-02-25 snapshot of the “Automatic exclusion of nuisance columns” section, we can see that the example produces correct output (see Out[170] in the code example): https://web.archive.org/web/20210225195813/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

By contrast the 2021-08-24 snapshot displays an empty dataframe for the Out[170] example: https://web.archive.org/web/20210824151314/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

However, note that the docs in both cases indicate that the example should produce a correct aggregation.

Expected Output

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce the result:

   dec_column
id           
1        0.75
2        0.55

For pandas 1.2.x it does so as expected. For pandas >= 1.3.0 it produces instead the incorrect

Empty DataFrame
Columns: []
Index: [1, 2]

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

simonjayhawkinscommented, Sep 3, 2021

I’ll leave this open til the docs are fixed.

1reaction

simonjayhawkinscommented, Sep 3, 2021

However I think highlighting the discrepancy in the User Guide “Group by” docs may be new, and should be resolved as part of that work?

Thanks for spotting that.

The docs are automatically generated so once the regression is fixed, the doc issue should also be fixed.

Top Results From Across the Web

Python: Pandas wrongly excluding column in groupby

It claims that it silently excludes columns if the aggregate function cannot be applied to the column. Consider the following example: I have...

What's new in 0.25.0 (July 18, 2019)

pandas has added special groupby behavior, known as “named aggregation”, for naming the output columns when applying multiple aggregation ...

Aggregations, Aggregations, Aggregations! Part 1

This of course is true only if we do not exclude the rows with missing values in the selected column. The Aggregation tabs...

SQL GROUP By and the "Column 'name' is invalid in ...

Why do I get an error "Column 'name' is invalid in the select list because t is not contained in either an aggregate...

GroupBy — cudf 22.10.00 documentation

Automatic exclusion of columns with unsupported dtypes (“nuisance” columns) when aggregating. Iterating over the groups of a GroupBy object.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation

Code Sample, a copy-pastable example

Problem description

Expected Output

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

BUG: rolling() function does not work with Float64 columns with missing values

DIS: Keywords for multi-threading capabilities