question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

dask.bag.random.sample returns dask.bag.core.Item instead of dask.bag

See original GitHub issue

What happened:

Say I have a bag A. What I want do is to pair a bag B, which is a smaller than A in terms of the number of elements, with a sampled version of A. For example: A = [1 2 3 4 5 6 7] B = [‘a’, ‘b’, ‘c’, ‘d’] I want C = [(‘a’, 3), (‘b’, 6), (‘c’, 5), (‘d’, 1)]

Looking at the documentation on dask.bag.random.sample, it looks like I can achieve this by:

import dask.bag as db
from dask.bag import random as db_random
from dask.distributed import Client
...
client = Client(n_workers=24, threads_per_worker=4)
...
A = db_random.choices(A, k=B.count())
C = db.zip(B, A).compute() 

However, it turned out that an AttributeError was raised by db.zip:

  File ".../lib/python3.7/site-packages/dask/bag/core.py", line 1913, in bag_zip
    assert all(bag.npartitions == npartitions for bag in bags)
  File ".../lib/python3.7/site-packages/dask/bag/core.py", line 1913, in <genexpr>
    assert all(bag.npartitions == npartitions for bag in bags)
AttributeError: 'Item' object has no attribute 'npartitions'

What you expected to happen:

dask.bag.random.sample would return bag instead of an Item.

Minimal Complete Verifiable Example:

Refers to the example above.

Anything else we need to know?:

Although I’m currently testing it in a single-workstation setting, I would eventually want the same code to be scaled on multi-nodes.

Please let me know, for the same purpose stated above, if there would be a better way to achieve the same thing.

Environment:

  • Dask version: 2.30.0
  • Python version: 3.7.8
  • Operating System: Ubuntu 18.04.4 LTS
  • Install method (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
jsignellcommented, Aug 12, 2021

Yep! Closing

1reaction
jsignellcommented, Jan 4, 2021

I think it’ll be fixed by https://github.com/dask/dask/pull/7027 and I think that’s a reasonable thing to do per-partition. I’ll just add a test and get someone who knows more about bags to review.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for dask.bag.core
For example, a bag of dictionaries could be written to JSON text files by mapping ... that each element will be returned. random_state...
Read more >
Dask Bags — Dask Examples documentation
Dask Bag implements operations like map , filter , groupby and ... We create a random set of record data and store it...
Read more >
dask.bag.Bag - Dask documentation
Cartesian product between two bags. random_sample (prob[, random_state]). Return elements from bag with probability of prob ...
Read more >
Create Dask Bags - Dask documentation
Instead, use Dask Bag to load your data. This parallelizes the loading step ... The resulting bag will have one item per line...
Read more >
dask.bag.random.choices - Dask documentation
Return a k sized list of elements chosen with replacement. Parameters. population: Bag. Elements to sample. k: integer, optional.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found