question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dropping nuisance columns in groupby is a nuisance

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
from decimal import Decimal
df = pd.DataFrame({'id': [1], 'x': [1], 'y': [Decimal(1)]})
df.groupby('id')[['x', 'y']].sum()

#     x
# id   
# 1   1

Problem description

I unknowingly encountered the feature described here when running the above code. While I see how this can be a useful feature, it’s a nuisance not knowing that it happened and that I can’t disable it. I feel that in the case of doing a groupby on explicitly selected columns groupby(...)[COLS], it should not drop any columns and let whatever errors that occur raise. I also think that a warning could be added and/or an option to disable the feature.

Expected Output

#     x  y
# id   
# 1   1  1

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.10.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.4.0 Cython: 0.28.1 numpy: 1.13.1 scipy: None pyarrow: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:4
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jchiacommented, Oct 31, 2018

This issue forms a nice pair with #17382. When your mean aggregation involves a timedelta column, the timedelta column silently disappears. This behavior is surprising to users unaware of the limitations of timedelta.

0reactions
ghost711commented, Dec 5, 2021

Correct me if I’m wrong, but I thought the OP’s concern was that columns shouldn’t be dropped if they’re explicitly specified.

Otherwise, it seems to me that the automatic dropping of nuisance columns is something that most people would want by default, with extra typing required to turn it off, not to turn it on.

At an iPython prompt for instance, where I used to be able to just type df.sum(), we now have to type df.sum(numeric_only=True) (long enough to make me question if I really want the answer bad enough to type it).

Regardless, almost no one will ever want their string columns summed for instance, so it seems like the default should be to drop them, with the rare person that actually wants that behavior able to specify numeric_only=False.

This seems to me to be just like how NaNs are silently ignored when summing or similar, without throwing errors or warnings.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python: Pandas wrongly excluding column in groupby
I have come across the Pandas' silent exclusion of nuisance columns as explained here:Pandas Nuisance columns. It claims ...
Read more >
Finding the mean of nuisance columns in DataFrame error ...
[Solved]-Finding the mean of nuisance columns in DataFrame error-Pandas,Python. No, there's no other way to solve it. Using numeric_only=True is the right way....
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
The automatic dropping of nuisance columns has been deprecated and will be removed in a future version of pandas. If columns are included...
Read more >
Introduction to Pandas (tutorial) — introPy - Lukas Snoek
/tmp/ipykernel_1905/3320203011.py:3: FutureWarning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_only=None') is deprecated; ...
Read more >
Programming for Data Science at URI Fall 2021
/tmp/ipykernel_2499/2274987639.py:1: FutureWarning: Dropping of nuisance columns in ... Dropping invalid columns in DataFrameGroupBy.mean is deprecated.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found