Random sampling of k elements from a dask bag
See original GitHub issueIt seems like there is no way, from the API, to randomly sample k elements from a dask bag.
#1332 Address a similar issue. It gives a random probability to each elements to be chosen however you are not guaranteed to have a fix number of element at the end.
Currently the way I would do it is:
import dask.bag as db
import random
data = db.from_sequence([1, 2, 3, 4, 5, 6])
random.sample(data, 3) # [2, 4, 5]
What I think a sample
method from dask.bag
should do:
import dask.bag as db
data = db.from_sequence([1, 2, 3, 4, 5, 6])
data.sample(k) # [2, 3, 6]
If this look like a desirable feature for the API I could try working on it during the next few weeks.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:21 (18 by maintainers)
Top Results From Across the Web
dask.bag.random.sample - Dask documentation
Chooses k unique random elements from a bag. Returns a new bag containing elements from the population while leaving the original population unchanged....
Read more >Can you randomly sample k values from a Dask series?
I want to randomly sample k values without replacement from a Dask series, and I don't want to compute the length of the...
Read more >Dask - How to handle large dataframes in python using ...
Introduction to Dask Bags; How to use Dask Bag for various operations? ... This k denotes that the first k elements should be...
Read more >API — Dask 2.23.0 documentation
Create Dask Dataframe from a Dask Bag. Bag.to_delayed (self[, optimize_graph]) ... random.sample (population, k). Chooses k unique random elements from a bag.
Read more >Comprehensive Dask Cheat Sheet for Beginners - Medium
Dask Bag implements operations like map, filter, and groupby on collections of ... Topk — returns the K largest elements in the collection....
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Ok, I take this issue. I am already on the way. Soon I will publish a PR and also an explanation of the proposed solution.
Fixed by #6208.