question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby aggregation on ordered Categorial with 'observed=True' breaks order

See original GitHub issue

Code Sample:

import pandas as pd

# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
            dict(cat = pd.Series([3, 1, 2, 1, 3, 2], 
                                 dtype=pd.CategoricalDtype(
                                              categories=[1, 2, 3, 4], 
                                              ordered=True)
                                ), 
                 val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
            )
)

Including unobserved categories gives correct groups:

# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')
cat
1    1.0
2    2.0
3    3.0
4    0.0
Name: val, dtype: float64

Excluding unobserved categories changes the order, groups are wrong:

# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')
cat
3    1.0
1    2.0
2    3.0
Name: val, dtype: float64

Problem description:

The sample code shows that grouping with an ordered factor does not respect the factor’s order when 'observed=True' is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical’s order. Thus, the result is wrong.

Related issues: #25167 There, the Categorical was unordered, and the sort=True parameter did not work as expected in combination with observed=True. In my case, sort makes no difference:

df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')

both give the same, wrong result as shown above.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.24.2 pytest: 4.3.0 pip: 19.0.3 setuptools: 40.8.0 Cython: 0.29.6 numpy: 1.16.2 scipy: 1.2.1 pyarrow: None xarray: 0.11.3 IPython: 7.1.1 sphinx: 1.8.5 patsy: 0.5.1 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.3 openpyxl: 2.6.1 xlrd: 1.2.0 xlwt: 1.3.0 xlsxwriter: 1.1.5 lxml.etree: 4.3.2 bs4: 4.7.1 html5lib: 1.0.1 sqlalchemy: 1.3.1 pymysql: None psycopg2: 2.7.6.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
kpflugshauptcommented, Mar 26, 2019

Right, I will try. Paid work keeps intruding, though…

0reactions
WillAydcommented, Mar 26, 2019

If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html

(Besides, as mentioned, Paid Work)

Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Group by: split-apply-combine — pandas 1.5.2 documentation
To ensure consistent ordering, the keys (and so output columns) will always be sorted for Python 3.5. Named aggregation is also valid for...
Read more >
Pandas .groupby(), Lambda Function, & Pivot Table Tutorial
The .groupby() function allows us to group records into buckets by categorical values, such as carrier, origin, and destination in this dataset. Since...
Read more >
Weird behaviour with groupby on ordered categorical columns
So you're saying the orderer Categorical variable gets lost and is treated as a string when the Multiindex is created? Sounds like a...
Read more >
Data manipulation in R - Stats and R
See the main functions to manipulate data in R such as how to subset a data frame, create a new variable, recode categorical...
Read more >
Binning Data with Pandas qcut and cut
What library are you using the plot the values? I know some of the python libraries will respect categorical ordering but all might...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found