question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

groupby-apply differences between categorical and non-categorical

See original GitHub issue

What happened: A groupby-apply operation on a partitioned dataframe processes (empty) duplicated indices if grouped over a categorical.

What you expected to happen: No duplication, which is the behavior with a non-categorical.

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd

data_cat = {'name': pd.Categorical(['A', 'B', 'B', 'A'], categories=['A', 'B'])}
meta_cat = dd.utils.make_meta((None, str), index=pd.CategoricalIndex(categories=['A', 'B'], name='name'))

data_str = {'name': ['A', 'B', 'B', 'A']}
meta_str = dd.utils.make_meta((None, str), index=pd.Index([''], name='name'))

def agg(frame):
    return 'bar' if len(frame.index) else 'empty'

def groupby_apply(data, meta):
    df = pd.DataFrame(data)
    ddf = dd.from_pandas(df, npartitions=3)
    result = ddf.groupby('name').apply(agg, meta=meta).compute()
    print(result)

print('string')
groupby_apply(data_str, meta_str)
print('\ncategorical')
groupby_apply(data_cat, meta_cat)

Output:

string
name
A    bar
B    bar
dtype: object

categorical
name
A      bar
B    empty
A    empty
B      bar
dtype: object

Anything else we need to know?:

Presumably this happens because each groupby-partition is expanding the full categorical, which could be avoided with the observed keyword. I see #6854 is making some related changes, though in this case I expected the observed=True is what is desired. @jsignell may have some thoughts, being recently familiar with this section of code.

Environment:

  • Dask version: 2.30.0+65.g3c64a880
  • Python version: 3.8.5
  • Operating System: Ubunutu 20.04
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
chrisroatcommented, Nov 24, 2020

Gotcha. I think having observed in apply comment was more wishful thinking. I see the logic/consistency in how things are done, even if it seemed a little surprising to me at first.

Thanks for the fix!

0reactions
jsignellcommented, Nov 24, 2020
  1. If I do a groupby, is a partition always equal to one key? What if that subset of data does not fit into memory?

From what I can tell, one key does not necessarily correspond to one partition. It looks like dask handles this internally to ensure that parition size doesn’t blow up. When using aggregations the output is often in one partition, but you can use split_out or split_every to control that.

  1. I want to understand the possible outcomes and how they are accessible. Is this correct?

There is no observed in apply so your cases simplify down to the first and the last.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Groupby inconsistency with categorical values #17032 - GitHub
I'm using categorical columns to save memory and possibly improve performance when having lots of strings, and would expect grouping by them to ......
Read more >
groupby shows unobserved values of non-categorical columns
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?
Read more >
Be Careful When Using Pandas Groupby with Categorical ...
In statistics, a categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values....
Read more >
What's new in 1.4.0 (January 22, 2022) - Pandas
GroupBy.apply () is designed to be flexible, allowing users to perform ... Previously in pandas 1.3, different code paths used different definitions of...
Read more >
Grouping Categorical Variables in Pandas Dataframe
Categorical are the datatype available in pandas library of python. A categorical variable takes only a fixed category (usually fixed number) of ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found