question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

add support for groups in train_test_split

See original GitHub issue

train_test_split has support for options ‘stratify’ and ‘shuffle’ but not ‘groups’.

I’m interested in adding support for ‘groups’ to train_test_split (so that all samples from the same group will be in either train or test but not both). In 0.18.1 there is support for the option ‘stratify’ and in master there is recently added support for ‘shuffle’. If others think it might be useful I’d like to make a PR.

The rules starts to get a little complicated with options for ‘stratify’, ‘shuffle’, and ‘groups’ interacting. It makes sense to throw an error when groups is not None and shuffle is False (there is similar logic for stratify and shuffle in master). And it makes sense for a ValueError to be raised if groups is not None and stratify is not None since there is a class GroupShuffleSplit and StratifyShuffleSplit but no StratifyGroupShuffleSplit.

The rules look something like this:

stratify shuffle groups behavior
None True None use ShuffleSplit
None False None no shuffling, just splits on n_train
not None True None use StratifiedShuffleSplit
not None False None raise ValueError
None True not None use GroupShuffleSplit (proposed)
None False not None raise ValueError (proposed)
not None True not None raise ValueError (proposed)
not None False not None raise ValueError (proposed)

Another possibility is for train_test_split to be explicitly passed a cross-validator class (rather than figuring it out), but that might be adding more burden on the caller, considering this is a convenience function.

If this is easier to discuss in the form of a PR, I’d be happy to submit one. And if I’m missing a simpler solution to this, I’d be happy to learn that.

thanks, Dennis

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Reactions:27
  • Comments:40 (15 by maintainers)

github_iconTop GitHub Comments

12reactions
amuellercommented, Jun 21, 2017

Hm I’m not sure we want to go overboard with this helper. You can do

train_inds, test_inds = GroupShuffleSplit().split(X, y, goups).next()
X_train, X_test, y_train, y_test = X[train_inds], X[test_inds], y[train_inds], y[test_inds]
11reactions
MaxPowerWasTakencommented, Aug 10, 2017

In case anyone else is using Python3 and got AttributeError: 'generator' object has no attribute 'next' from the code:

train_inds, test_inds = GroupShuffleSplit().split(X, groups=groups).next()

…the following works in Python3:

train_inds, test_inds = next(GroupShuffleSplit().split(X, groups=groups))

Read more comments on GitHub >

github_iconTop Results From Across the Web

sklearn.model_selection.GroupShuffleSplit
Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary ...
Read more >
How to generate a train-test-split based on a group id?
I figured out the answer. This seems to work: from sklearn.model_selection import GroupShuffleSplit splitter ...
Read more >
Split Your Dataset With scikit-learn's train_test_split()
In this tutorial, you'll learn why it's important to split your dataset in supervised machine learning and how to do that with train_test_split()...
Read more >
A Guide on Splitting Datasets With Train_test_split Function
TL;DR – The train_test_split function is for splitting a single dataset for two different purposes: training and testing.
Read more >
Splitting Datasets in Python With scikit-learn and ... - YouTube
Using train_test_split () from the data science library scikit-learn, you can split your dataset into subsets that minimize the potential for ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found