Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby aggregation on ordered Categorial with 'observed=True' breaks order

See original GitHub issue

Code Sample:

import pandas as pd

# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
            dict(cat = pd.Series([3, 1, 2, 1, 3, 2], 
                                 dtype=pd.CategoricalDtype(
                                              categories=[1, 2, 3, 4], 
                                              ordered=True)
                                ), 
                 val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
            )
)

Including unobserved categories gives correct groups:

# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')

cat
1    1.0
2    2.0
3    3.0
4    0.0
Name: val, dtype: float64

Excluding unobserved categories changes the order, groups are wrong:

# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')

cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

Problem description:

The sample code shows that grouping with an ordered factor does not respect the factor’s order when 'observed=True' is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical’s order. Thus, the result is wrong.

Related issues: #25167 There, the Categorical was unordered, and the sort=True parameter did not work as expected in combination with observed=True. In my case, sort makes no difference:

df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

both give the same, wrong result as shown above.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.2 pytest: 4.3.0 pip: 19.0.3 setuptools: 40.8.0 Cython: 0.29.6 numpy: 1.16.2 scipy: 1.2.1 pyarrow: None xarray: 0.11.3 IPython: 7.1.1 sphinx: 1.8.5 patsy: 0.5.1 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.3 openpyxl: 2.6.1 xlrd: 1.2.0 xlwt: 1.3.0 xlsxwriter: 1.1.5 lxml.etree: 4.3.2 bs4: 4.7.1 html5lib: 1.0.1 sqlalchemy: 1.3.1 pymysql: None psycopg2: 2.7.6.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

kpflugshauptcommented, Mar 26, 2019

Right, I will try. Paid work keeps intruding, though…

0reactions

WillAydcommented, Mar 26, 2019

If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

(Besides, as mentioned, Paid Work)

Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome!

Top Results From Across the Web

Group by: split-apply-combine — pandas 1.5.2 documentation

To ensure consistent ordering, the keys (and so output columns) will always be sorted for Python 3.5. Named aggregation is also valid for...

Pandas .groupby(), Lambda Function, & Pivot Table Tutorial

The .groupby() function allows us to group records into buckets by categorical values, such as carrier, origin, and destination in this dataset. Since...

Weird behaviour with groupby on ordered categorical columns

So you're saying the orderer Categorical variable gets lost and is treated as a string when the Multiindex is created? Sounds like a...

Data manipulation in R - Stats and R

See the main functions to manipulate data in R such as how to subset a data frame, create a new variable, recode categorical...

Binning Data with Pandas qcut and cut

What library are you using the plot the values? I know some of the python libraries will respect categorical ordering but all might...