question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

No support for stratified split in dask_ml.model_selection.train_test_split

See original GitHub issue

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:20 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
chauhankaranrajcommented, Jul 31, 2022

Hey folks, sorry but I haven’t had the chance to continue working on this. I did open a WIP PR (#635) so if anyone would like to fork off of it or just start from scratch, feel free to do so! Let me know if you’d like anything from me in doing so.

0reactions
kennylidscommented, Jul 28, 2022

I need the stratify feature in tran_split_test as well for my imbalanced dataset. Any updates?

Read more comments on GitHub >

github_iconTop Results From Across the Web

dask_ml.model_selection.train_test_split
Split arrays into random train and test matrices. ... blockwisebool, default True. Whether to shuffle data only within blocks (True), or allow data...
Read more >
Cannot operate on Dask array with unknown chunk sizes - ...
I have a text classification dataset where I used dask parquet to save disk space, but run into the problem now when I...
Read more >
Is it possible to have stratified train-test split of a set based ...
Consider a dataframe that contains two columns, text and label . I can very easily create a stratified train-test split using sklearn. model_...
Read more >
Model Training - Dask
Dask has a specific module called dask_ml that replicates the features of scikit-learn accelerated with parallelization. We will use that feature and split...
Read more >
scikit-learn user guide
In the multilabel case however, splits are still not stratified. ... Fixed a bug where sklearn.model_selection.train_test_split raised an ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found