dask.bag.random.sample returns dask.bag.core.Item instead of dask.bag
See original GitHub issueWhat happened:
Say I have a bag A. What I want do is to pair a bag B, which is a smaller than A in terms of the number of elements, with a sampled version of A. For example: A = [1 2 3 4 5 6 7] B = [‘a’, ‘b’, ‘c’, ‘d’] I want C = [(‘a’, 3), (‘b’, 6), (‘c’, 5), (‘d’, 1)]
Looking at the documentation on dask.bag.random.sample
, it looks like I can achieve this by:
import dask.bag as db
from dask.bag import random as db_random
from dask.distributed import Client
...
client = Client(n_workers=24, threads_per_worker=4)
...
A = db_random.choices(A, k=B.count())
C = db.zip(B, A).compute()
However, it turned out that an AttributeError
was raised by db.zip
:
File ".../lib/python3.7/site-packages/dask/bag/core.py", line 1913, in bag_zip
assert all(bag.npartitions == npartitions for bag in bags)
File ".../lib/python3.7/site-packages/dask/bag/core.py", line 1913, in <genexpr>
assert all(bag.npartitions == npartitions for bag in bags)
AttributeError: 'Item' object has no attribute 'npartitions'
What you expected to happen:
dask.bag.random.sample
would return bag
instead of an Item
.
Minimal Complete Verifiable Example:
Refers to the example above.
Anything else we need to know?:
Although I’m currently testing it in a single-workstation setting, I would eventually want the same code to be scaled on multi-nodes.
Please let me know, for the same purpose stated above, if there would be a better way to achieve the same thing.
Environment:
- Dask version: 2.30.0
- Python version: 3.7.8
- Operating System: Ubuntu 18.04.4 LTS
- Install method (conda, pip, source): conda
Issue Analytics
- State:
- Created 3 years ago
- Comments:13 (13 by maintainers)
Yep! Closing
I think it’ll be fixed by https://github.com/dask/dask/pull/7027 and I think that’s a reasonable thing to do per-partition. I’ll just add a test and get someone who knows more about bags to review.