groupby-apply differences between categorical and non-categorical
See original GitHub issueWhat happened: A groupby-apply operation on a partitioned dataframe processes (empty) duplicated indices if grouped over a categorical.
What you expected to happen: No duplication, which is the behavior with a non-categorical.
Minimal Complete Verifiable Example:
import pandas as pd
import dask.dataframe as dd
data_cat = {'name': pd.Categorical(['A', 'B', 'B', 'A'], categories=['A', 'B'])}
meta_cat = dd.utils.make_meta((None, str), index=pd.CategoricalIndex(categories=['A', 'B'], name='name'))
data_str = {'name': ['A', 'B', 'B', 'A']}
meta_str = dd.utils.make_meta((None, str), index=pd.Index([''], name='name'))
def agg(frame):
return 'bar' if len(frame.index) else 'empty'
def groupby_apply(data, meta):
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=3)
result = ddf.groupby('name').apply(agg, meta=meta).compute()
print(result)
print('string')
groupby_apply(data_str, meta_str)
print('\ncategorical')
groupby_apply(data_cat, meta_cat)
Output:
string
name
A bar
B bar
dtype: object
categorical
name
A bar
B empty
A empty
B bar
dtype: object
Anything else we need to know?:
Presumably this happens because each groupby-partition is expanding the full categorical, which could be avoided with the observed keyword. I see #6854 is making some related changes, though in this case I expected the observed=True is what is desired. @jsignell may have some thoughts, being recently familiar with this section of code.
Environment:
- Dask version: 2.30.0+65.g3c64a880
- Python version: 3.8.5
- Operating System: Ubunutu 20.04
- Install method (conda, pip, source): pip
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
Groupby inconsistency with categorical values #17032 - GitHub
I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to ......
Read more >groupby shows unobserved values of non-categorical columns
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
Read more >Be Careful When Using Pandas Groupby with Categorical ...
In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values....
Read more >What's new in 1.4.0 (January 22, 2022) - Pandas
GroupBy.apply () is designed to be flexible, allowing users to perform ... Previously in pandas 1.3, different code paths used different definitions of...
Read more >Grouping Categorical Variables in Pandas Dataframe
Categorical are the datatype available in pandas library of python. A categorical variable takes only a fixed category (usually fixed number) of ......
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Gotcha. I think having
observed
inapply
comment was more wishful thinking. I see the logic/consistency in how things are done, even if it seemed a little surprising to me at first.Thanks for the fix!
From what I can tell, one key does not necessarily correspond to one partition. It looks like dask handles this internally to ensure that parition size doesn’t blow up. When using aggregations the output is often in one partition, but you can use
split_out
orsplit_every
to control that.There is no
observed
inapply
so your cases simplify down to the first and the last.