question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby with categorical type returns all combinations

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                                                                                                                    
df = pd.DataFrame({'a': ['x','x','y'], 'b': [0,1,0], 'c': [7,8,9]})                                                                                                                                                                    
print(df.groupby(['a','b']).mean().reset_index())                                                                                                                                                                                      
df['a'] = df['a'].astype('category')                                                                                                                                                                                                   
print(df.groupby(['a','b']).mean().reset_index())

Returns two different results:

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

   a  b    c
0  x  0  7.0
1  x  1  8.0
2  y  0  9.0
3  y  1  NaN

Problem description

Performing a groupby with a categorical type returns all combination of the groupby columns. This is a problem in my actual application as it results in a massive dataframe that is mostly filled with nans. I would also prefer not to move off of category dtype since it provides necessary memory savings.

Expected Output

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

   a  b  c
0  x  0  7
1  x  1  8
2  y  0  9

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.6.1.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-26-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 33.1.1 Cython: None numpy: 1.13.1 scipy: 0.19.0 xarray: None IPython: 6.1.0 sphinx: None patsy: None dateutil: 2.6.1 pytz: 2017.2 blosc: None bottleneck: None tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: None xlrd: 1.0.0 xlwt: None xlsxwriter: 0.9.6 lxml: None bs4: 4.5.3 html5lib: 0.999999999 sqlalchemy: None pymysql: None psycopg2: 2.7.1 (dt dec pq3 ext lo64) jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:5
  • Comments:17 (7 by maintainers)

github_iconTop GitHub Comments

18reactions
mattharrisoncommented, Oct 23, 2019

Oh, the observed parameter in .groupby! Awesome, thanks!

I would prefer it to default to True and also work with any grouper.

15reactions
jrebackcommented, Oct 23, 2019

this is exactly what observed does

Read more comments on GitHub >

github_iconTop Results From Across the Web

A GroupBy with combinations of the categorical variables
So first of all, that is not a valid dataframe. The indexes aren't unique. Let's add another index to that object and make...
Read more >
Grouping Categorical Variables in Pandas Dataframe
This is done using the groupby() method given in pandas. It returns all the combinations of groupby columns. Along with group by we...
Read more >
Group by: split-apply-combine — pandas 1.5.2 documentation
When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword controls whether to return a...
Read more >
Comprehensive Guide to Grouping and Aggregating with ...
Pandas groupby and aggregation provide powerful capabilities for summarizing data. This article will discuss basic functionality as well as ...
Read more >
Summary statistics organized by group - MATLAB grpstats
grpstats returns summary statistics only for the combinations of values that exist in the grouping variables (not all possible combinations). Data Types: single ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found