groupby aggregation on ordered Categorial with 'observed=True' breaks order
See original GitHub issueCode Sample:
import pandas as pd
# Create a DataFrame with an ordered categorical column, one category not present
df = pd.DataFrame(
dict(cat = pd.Series([3, 1, 2, 1, 3, 2],
dtype=pd.CategoricalDtype(
categories=[1, 2, 3, 4],
ordered=True)
),
val = pd.Series([1.5, 0.5, 1.0, 0.5, 1.5, 1.0])
)
)
Including unobserved categories gives correct groups:
# Sum 'val' grouped by 'cat', including unobserved categories
df.groupby('cat', observed=False)['val'].agg('sum')
cat
1 1.0
2 2.0
3 3.0
4 0.0
Name: val, dtype: float64
Excluding unobserved categories changes the order, groups are wrong:
# Sum 'val' grouped by 'cat', excluding unobserved categories
df.groupby('cat', observed=True)['val'].agg('sum')
cat
3 1.0
1 2.0
2 3.0
Name: val, dtype: float64
Problem description:
The sample code shows that grouping with an ordered factor does not respect the factor’s order when 'observed=True'
is set. Instead, group labels are in order of first occurrence in the Categorical, as if it were unordered. The aggregation results, however, are in the Categorical’s order.
Thus, the result is wrong.
Related issues: #25167
There, the Categorical was unordered, and the sort=True
parameter did not work as expected in combination with observed=True
. In my case, sort
makes no difference:
df.groupby('cat', observed=True, sort=True)['val'].agg('sum')
df.groupby('cat', observed=True, sort=False)['val'].agg('sum')
both give the same, wrong result as shown above.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None python: 3.7.2.final.0 python-bits: 64 OS: Darwin OS-release: 18.2.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.24.2 pytest: 4.3.0 pip: 19.0.3 setuptools: 40.8.0 Cython: 0.29.6 numpy: 1.16.2 scipy: 1.2.1 pyarrow: None xarray: 0.11.3 IPython: 7.1.1 sphinx: 1.8.5 patsy: 0.5.1 dateutil: 2.8.0 pytz: 2018.9 blosc: None bottleneck: 1.2.1 tables: 3.4.4 numexpr: 2.6.9 feather: None matplotlib: 3.0.3 openpyxl: 2.6.1 xlrd: 1.2.0 xlwt: 1.3.0 xlsxwriter: 1.1.5 lxml.etree: 4.3.2 bs4: 4.7.1 html5lib: 1.0.1 sqlalchemy: 1.3.1 pymysql: None psycopg2: 2.7.6.1 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: 0.2.1 pandas_gbq: None pandas_datareader: None gcsfs: None
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:7 (7 by maintainers)
Right, I will try. Paid work keeps intruding, though…
If you could push a PR I can take a look on the review side. Be sure to check out the contributing guide if you have trouble and you an ask specific development questions on Gitter:
https://pandas.pydata.org/pandas-docs/stable/development/contributing.html
Pandas is for all practical purposes maintained entirely by volunteers - any help you can add to that is certainly welcome!