Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby-apply differences between categorical and non-categorical

See original GitHub issue

What happened: A groupby-apply operation on a partitioned dataframe processes (empty) duplicated indices if grouped over a categorical.

What you expected to happen: No duplication, which is the behavior with a non-categorical.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

data_cat = {'name': pd.Categorical(['A', 'B', 'B', 'A'], categories=['A', 'B'])}
meta_cat = dd.utils.make_meta((None, str), index=pd.CategoricalIndex(categories=['A', 'B'], name='name'))

data_str = {'name': ['A', 'B', 'B', 'A']}
meta_str = dd.utils.make_meta((None, str), index=pd.Index([''], name='name'))

def agg(frame):
    return 'bar' if len(frame.index) else 'empty'

def groupby_apply(data, meta):
    df = pd.DataFrame(data)
    ddf = dd.from_pandas(df, npartitions=3)
    result = ddf.groupby('name').apply(agg, meta=meta).compute()
    print(result)

print('string')
groupby_apply(data_str, meta_str)
print('\ncategorical')
groupby_apply(data_cat, meta_cat)

Output:

string
name
A    bar
B    bar
dtype: object

categorical
name
A      bar
B    empty
A    empty
B      bar
dtype: object

Anything else we need to know?:

Presumably this happens because each groupby-partition is expanding the full categorical, which could be avoided with the observed keyword. I see #6854 is making some related changes, though in this case I expected the observed=True is what is desired. @jsignell may have some thoughts, being recently familiar with this section of code.

Environment:

Dask version: 2.30.0+65.g3c64a880
Python version: 3.8.5
Operating System: Ubunutu 20.04
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 3 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

chrisroatcommented, Nov 24, 2020

Gotcha. I think having observed in apply comment was more wishful thinking. I see the logic/consistency in how things are done, even if it seemed a little surprising to me at first.

Thanks for the fix!

0reactions

jsignellcommented, Nov 24, 2020

If I do a groupby, is a partition always equal to one key? What if that subset of data does not fit into memory?

From what I can tell, one key does not necessarily correspond to one partition. It looks like dask handles this internally to ensure that parition size doesn’t blow up. When using aggregations the output is often in one partition, but you can use split_out or split_every to control that.

I want to understand the possible outcomes and how they are accessible. Is this correct?

There is no observed in apply so your cases simplify down to the first and the last.

Top Results From Across the Web

Groupby inconsistency with categorical values #17032 - GitHub

I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to ......

groupby shows unobserved values of non-categorical columns

Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?

Be Careful When Using Pandas Groupby with Categorical ...

In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values....

What's new in 1.4.0 (January 22, 2022) - Pandas

GroupBy.apply () is designed to be flexible, allowing users to perform ... Previously in pandas 1.3, different code paths used different definitions of...

Grouping Categorical Variables in Pandas Dataframe

Categorical are the datatype available in pandas library of python. A categorical variable takes only a fixed category (usually fixed number) of ......