Dropping nuisance columns in groupby is a nuisance
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
from decimal import Decimal
df = pd.DataFrame({'id': [1], 'x': [1], 'y': [Decimal(1)]})
df.groupby('id')[['x', 'y']].sum()
# x
# id
# 1 1
Problem description
I unknowingly encountered the feature described here when running the above code. While I see how this can be a useful feature, it’s a nuisance not knowing that it happened and that I can’t disable it. I feel that in the case of doing a groupby on explicitly selected columns groupby(...)[COLS]
, it should not drop any columns and let whatever errors that occur raise. I also think that a warning could be added and/or an option to disable the feature.
Expected Output
# x y
# id
# 1 1 1
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 3.10.0-327.10.1.el7.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: None LOCALE: None.None pandas: 0.22.0 pytest: 3.2.1 pip: 9.0.1 setuptools: 36.4.0 Cython: 0.28.1 numpy: 1.13.1 scipy: None pyarrow: None xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: None openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: 1.1.13 pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:6 (2 by maintainers)
Top GitHub Comments
This issue forms a nice pair with #17382. When your mean aggregation involves a timedelta column, the timedelta column silently disappears. This behavior is surprising to users unaware of the limitations of timedelta.
Correct me if I’m wrong, but I thought the OP’s concern was that columns shouldn’t be dropped if they’re explicitly specified.
Otherwise, it seems to me that the automatic dropping of nuisance columns is something that most people would want by default, with extra typing required to turn it off, not to turn it on.
At an iPython prompt for instance, where I used to be able to just type
df.sum()
, we now have to typedf.sum(numeric_only=True)
(long enough to make me question if I really want the answer bad enough to type it).Regardless, almost no one will ever want their string columns summed for instance, so it seems like the default should be to drop them, with the rare person that actually wants that behavior able to specify
numeric_only=False
.This seems to me to be just like how NaNs are silently ignored when summing or similar, without throwing errors or warnings.