Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Series groupby does not include zero or nan counts for all categorical labels, unlike DataFrame groupby

See original GitHub issue

Steps to reproduce

In [1]: import pandas

In [2]: df = pandas.DataFrame({'type': pandas.Categorical(['AAA', 'AAA', 'B', 'C']),
   ...:                        'voltage': pandas.Series([1.5, 1.5, 1.5, 1.5]),
   ...:                        'treatment': pandas.Categorical(['T', 'C', 'T', 'C'])})

In [3]: df.groupby(['treatment', 'type']).count()
Out[3]:
                voltage
treatment type
C         AAA       1.0
          B         NaN
          C         1.0
T         AAA       1.0
          B         1.0
          C         NaN

In [4]: df.groupby(['treatment', 'type'])['voltage'].count()
Out[4]:
treatment  type
C          AAA     1
           C       1
T          AAA     1
           B       1
Name: voltage, dtype: int64

Problem description

When performing a groupby on categorical columns, categories with empty groups should be present in output. That is, the multi-index of the object returned by count() should contain the Cartesian product of all the labels of the first categorical column ("treatment" in the example above) and the second categorical column ("type") by which the grouping was performed.

The behavior in cell [3] above is correct. But in cell [4], after obtaining a pandas.core.groupby.SeriesGroupBy object, the series returned by the count() method does not have entries for all levels of the "type" categorical.

Expected Output

The output from cell [4] should be equivalent to this output, with length 6, and include values for the index values (C, B) and (T, C).

In [5]: df.groupby(['treatment', 'type']).count().squeeze()
Out[5]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64

Workaround

Perform column access after calling count():

In [7]: df.groupby(['treatment', 'type']).count()['voltage']
Out[7]:
treatment  type
C          AAA     1.0
           B       NaN
           C       1.0
T          AAA     1.0
           B       1.0
           C       NaN
Name: voltage, dtype: float64

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: 0.8.2
IPython: 5.1.0
sphinx: 1.4.8
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.7
blosc: 1.5.0
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.3
html5lib: 0.999
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.3
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.43.0
pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

cottrellcommented, Dec 9, 2019

I don’t think we are talking about the same thing.

A reasonable test to block this default change would have been any test that fails due to explosion of dimensions when observed=False.

The test would need to run and try to produce an array too large to compute. If the test was runnable with observed=False, then it would have been an invalid test.

As the new default is in, there is nothing to block anymore and this kind of test has no value in the current state. It is probably the good state since now anyway since everything must be explicit in high arity cases.

Below is the example above in the two cases for ref.

1reaction

cottrellcommented, Dec 7, 2019

@jreback is there a way of doing that though without killing the test framework? I don’t think it is a test-worthy case really … I mean simply that if you have 20k rows indexed by three cols with arity 10k x 10k x 10k, you will get a cube ravelled to 1e12 rows with the default settings. Setting observed=True gives < 20k rows.

The new default is fine, probably best folks learn to turn off the cartesian expansion. But could hit people if they upgrade old code.

Top Results From Across the Web

pandas.DataFrame.groupby

Groupby preserves the order of rows within each group. Changed in version 2.0.0: Specifying sort=False with an ordered categorical grouper will no longer...

pandas GroupBy columns with NaN (missing) values

Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I...

4 Pandas GroupBy Tricks You Should Know | Medium

Trick 1. Let's manually assign a NaN value to the data frame. The above code will create a NaN value for the first...

Pandas groupby() Explained With Examples

Similar to the SQL GROUP BY clause pandas DataFrame.groupby() function is used to collect identical data into groups and perform aggregate functions on....

pandas GroupBy: Your Guide to Grouping Data in Python

count () , since you know that there are no NaN last names. Using .count() excludes NaN values, while .size() includes everything, NaN...