dask.dataframe.groupby with tuple keys fails
See original GitHub issueTo reproduce:
In [1]: import numpy as np; np.__version__
Out[1]: '1.13.3'
In [2]: import pandas as pd; pd.__version__
Out[2]: '0.21.1'
In [3]: import dask.dataframe; dask.__version__
Out[3]: '0.16.0'
In [4]: df = pd.DataFrame(np.random.choice((0,1), (10, 3)), columns=list('abc'))
In [5]: df.groupby('b').apply(len)
Out[5]:
b
0 3
1 7
dtype: int64
In [6]: df.groupby(('b', 'c')).apply(len)
Out[6]:
b c
0 1 3
1 0 4
1 3
dtype: int64
In [7]: ddf = dask.dataframe.from_pandas(df, npartitions=2)
In [8]: ddf.groupby('b').apply(len, meta=int).compute()
Out[8]:
b
1 7
0 3
dtype: int64
In [9]: ddf.groupby(('b', 'c')).apply(len, meta=int).compute()
...
ValueError: Wrong number of items passed 0, placement implies 5
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (4 by maintainers)
Top Results From Across the Web
Unpack tuple inside function when using Dask map partitions
I'm trying to run a function over many partitions of a Dask dataframe. The code requires unpacking tuples and ...
Read more >dask.dataframe.groupby - Dask documentation
Since grouping by an unaligned key is generally a bad idea, we just error loudly ... Pandas treats tuples as a single key,...
Read more >What's New — pandas 0.23.0 documentation - PyData |
In the future, a tuple passed to 'by' will always refer to a single key that ... Bug in MultiIndex.from_tuples() which would fail...
Read more >Dask DataFrame Groupby | Why it Can Fail & How ... - YouTube
Dask DataFrame groupby operations are very common and very powerful. However due to the distributed nature of Dask DataFrames, they can fail ......
Read more >4. Dask DataFrame - Scaling Python with Dask [Book] - O'Reilly
Dask DataFrames need to know the types of the different columns to serialize ... This filter is not an arbitrary expression; rather, it...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Looks like we enforce that this is a list, not a tuple. The following works for me:
This is probably a simple thing to fix, if you’d like to submit a PR 😃.
The issue in https://github.com/dask/dask/issues/3047#issuecomment-355595521 is still present (I actually get a SystemError now, fun). But I think given pandas’ difficulties with handling keys as tuples I’m comfortable ignoring it now.
The original issue of specifying a list of keys as tuples currently (correctly) raises in both pandas and Dask.