Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby with categorical type returns all combinations

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                                                                                                                    
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9]})                                                                                                                                                                    
print(df.groupby(['a','b']).mean().reset_index())                                                                                                                                                                                      
df['a'] = df['a'].astype('category')                                                                                                                                                                                                   
print(df.groupby(['a','b']).mean().reset_index())

Returns two different results:

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

   a  b    c
0  x  0  7.0
1  x  1  8.0
2  y  0  9.0
3  y  1  NaN

Problem description

Performing a groupby with a categorical type returns all combination of the groupby columns. This is a problem in my actual application as it results in a massive dataframe that is mostly filled with nans. I would also prefer not to move off of category dtype since it provides necessary memory savings.

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 33.1.1 Cython: None numpy: 1.13.1 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:5
Comments:17 (7 by maintainers)

Top GitHub Comments

18reactions

mattharrisoncommented, Oct 23, 2019

Oh, the observed parameter in .groupby! Awesome, thanks!

I would prefer it to default to True and also work with any grouper.

15reactions

jrebackcommented, Oct 23, 2019

this is exactly what observed does

Top Results From Across the Web

A GroupBy with combinations of the categorical variables

So first of all, that is not a valid dataframe. The indexes aren't unique. Let's add another index to that object and make...

Grouping Categorical Variables in Pandas Dataframe

This is done using the groupby() method given in pandas. It returns all the combinations of groupby columns. Along with group by we...

Group by: split-apply-combine — pandas 1.5.2 documentation

When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a...

Comprehensive Guide to Grouping and Aggregating with ...

Pandas groupby and aggregation provide powerful capabilities for summarizing data. This article will discuss basic functionality as well as ...

Summary statistics organized by group - MATLAB grpstats

grpstats returns summary statistics only for the combinations of values that exist in the grouping variables (not all possible combinations). Data Types: single ......