Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask.dataframe.groupby with tuple keys fails

See original GitHub issue

To reproduce:

In [1]: import numpy as np; np.__version__
Out[1]: '1.13.3'

In [2]: import pandas as pd; pd.__version__
Out[2]: '0.21.1'

In [3]: import dask.dataframe; dask.__version__
Out[3]: '0.16.0'

In [4]: df = pd.DataFrame(np.random.choice((0,1), (10, 3)), columns=list('abc'))

In [5]: df.groupby('b').apply(len)
Out[5]: 
b
0    3
1    7
dtype: int64

In [6]: df.groupby(('b', 'c')).apply(len)
Out[6]: 
b  c
0  1    3
1  0    4
   1    3
dtype: int64

In [7]: ddf = dask.dataframe.from_pandas(df, npartitions=2)

In [8]: ddf.groupby('b').apply(len, meta=int).compute()
Out[8]: 
b
1    7
0    3
dtype: int64

In [9]: ddf.groupby(('b', 'c')).apply(len, meta=int).compute()
...
ValueError: Wrong number of items passed 0, placement implies 5

Issue Analytics

State:
Created 6 years ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, Jan 5, 2018

Looks like we enforce that this is a list, not a tuple. The following works for me:

In [15]: ddf.groupby(['b', 'c']).apply(len, meta=int).compute()
Out[15]:
b  c
0  1    2
1  0    4
0  0    3
1  1    1
dtype: int64

This is probably a simple thing to fix, if you’d like to submit a PR 😃.

0reactions

TomAugspurgercommented, Sep 11, 2020

The issue in https://github.com/dask/dask/issues/3047#issuecomment-355595521 is still present (I actually get a SystemError now, fun). But I think given pandas’ difficulties with handling keys as tuples I’m comfortable ignoring it now.

The original issue of specifying a list of keys as tuples currently (correctly) raises in both pandas and Dask.