question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask.dataframe.groupby with tuple keys fails

See original GitHub issue

To reproduce:

In [1]: import numpy as np; np.__version__
Out[1]: '1.13.3'

In [2]: import pandas as pd; pd.__version__
Out[2]: '0.21.1'

In [3]: import dask.dataframe; dask.__version__
Out[3]: '0.16.0'

In [4]: df = pd.DataFrame(np.random.choice((0,1), (10, 3)), columns=list('abc'))

In [5]: df.groupby('b').apply(len)
Out[5]: 
b
0    3
1    7
dtype: int64

In [6]: df.groupby(('b', 'c')).apply(len)
Out[6]: 
b  c
0  1    3
1  0    4
   1    3
dtype: int64

In [7]: ddf = dask.dataframe.from_pandas(df, npartitions=2)

In [8]: ddf.groupby('b').apply(len, meta=int).compute()
Out[8]: 
b
1    7
0    3
dtype: int64

In [9]: ddf.groupby(('b', 'c')).apply(len, meta=int).compute()
...
ValueError: Wrong number of items passed 0, placement implies 5

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Jan 5, 2018

Looks like we enforce that this is a list, not a tuple. The following works for me:

In [15]: ddf.groupby(['b', 'c']).apply(len, meta=int).compute()
Out[15]:
b  c
0  1    2
1  0    4
0  0    3
1  1    1
dtype: int64

This is probably a simple thing to fix, if you’d like to submit a PR 😃.

0reactions
TomAugspurgercommented, Sep 11, 2020

The issue in https://github.com/dask/dask/issues/3047#issuecomment-355595521 is still present (I actually get a SystemError now, fun). But I think given pandas’ difficulties with handling keys as tuples I’m comfortable ignoring it now.

The original issue of specifying a list of keys as tuples currently (correctly) raises in both pandas and Dask.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unpack tuple inside function when using Dask map partitions
I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and ...
Read more >
dask.dataframe.groupby - Dask documentation
Since grouping by an unaligned key is generally a bad idea, we just error loudly ... Pandas treats tuples as a single key,...
Read more >
What's New — pandas 0.23.0 documentation - PyData |
In the future, a tuple passed to 'by' will always refer to a single key that ... Bug in MultiIndex.from_tuples() which would fail...
Read more >
Dask DataFrame Groupby | Why it Can Fail & How ... - YouTube
Dask DataFrame groupby operations are very common and very powerful. However due to the distributed nature of Dask DataFrames, they can fail ......
Read more >
4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly
Dask DataFrames need to know the types of the different columns to serialize ... This filter is not an arbitrary expression; rather, it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found