question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Random sampling of k elements from a dask bag

See original GitHub issue

It seems like there is no way, from the API, to randomly sample k elements from a dask bag.

#1332 Address a similar issue. It gives a random probability to each elements to be chosen however you are not guaranteed to have a fix number of element at the end.

Currently the way I would do it is:

import dask.bag as db
import random

data = db.from_sequence([1, 2, 3, 4, 5, 6])
random.sample(data, 3)  # [2, 4, 5]

What I think a sample method from dask.bag should do:

import dask.bag as db

data = db.from_sequence([1, 2, 3, 4, 5, 6])
data.sample(k)  # [2, 3, 6]

If this look like a desirable feature for the API I could try working on it during the next few weeks.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:1
  • Comments:21 (18 by maintainers)

github_iconTop GitHub Comments

1reaction
eraclecommented, Apr 29, 2020

Ok, I take this issue. I am already on the way. Soon I will publish a PR and also an explanation of the proposed solution.

0reactions
jcristcommented, Jun 2, 2020

Fixed by #6208.

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask.bag.random.sample - Dask documentation
Chooses k unique random elements from a bag. Returns a new bag containing elements from the population while leaving the original population unchanged....
Read more >
Can you randomly sample k values from a Dask series?
I want to randomly sample k values without replacement from a Dask series, and I don't want to compute the length of the...
Read more >
Dask - How to handle large dataframes in python using ...
Introduction to Dask Bags; How to use Dask Bag for various operations? ... This k denotes that the first k elements should be...
Read more >
API — Dask 2.23.0 documentation
Create Dask Dataframe from a Dask Bag. Bag.to_delayed (self[, optimize_graph]) ... random.sample (population, k). Chooses k unique random elements from a bag.
Read more >
Comprehensive Dask Cheat Sheet for Beginners - Medium
Dask Bag implements operations like map, filter, and groupby on collections of ... Topk — returns the K largest elements in the collection....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found