groupby with categorical type returns all combinations
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9]})
print(df.groupby(['a','b']).mean().reset_index())
df['a'] = df['a'].astype('category')
print(df.groupby(['a','b']).mean().reset_index())
Returns two different results:
a b c
0 x 0 7
1 x 1 8
2 y 0 9
a b c
0 x 0 7.0
1 x 1 8.0
2 y 0 9.0
3 y 1 NaN
Problem description
Performing a groupby with a categorical type returns all combination of the groupby columns. This is a problem in my actual application as it results in a massive dataframe that is mostly filled with nans. I would also prefer not to move off of category dtype since it provides necessary memory savings.
Expected Output
a b c
0 x 0 7
1 x 1 8
2 y 0 9
a b c
0 x 0 7
1 x 1 8
2 y 0 9
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 33.1.1 Cython: None numpy: 1.13.1 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:5
- Comments:17 (7 by maintainers)
Top GitHub Comments
Oh, the
observed
parameter in.groupby
! Awesome, thanks!I would prefer it to default to
True
and also work with any grouper.this is exactly what observed does