add three-fold split method train/val/test
See original GitHub issueAsking how to do a threefold split is the top sklearn question on stackoverflow: https://stackoverflow.com/questions/tagged/scikit-learn?sort=frequent&pageSize=50
We have discussed this before but I think this is a good reason to add it - the other option would be to document more explictly doing
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)
if that’s the idiomatic way to do that. The “issue” with that way is that it’s harder to figure out the ratios.
If we want to add a threefold split method, there’s three options:
- add another parameter to
train_test_split(not sure I like this) - create a new method that mirrors the interface of
train_test_split - create a new method with a better interface than
train_test_split
I kinda prefer the last one but it might be confusing to users. My ideal signature would not have *args and explicitly name X and y so we could stratify by default.
Maybe
naming_is_hard(X=None, y=None, fit_parms=None, train_size=None, val_size=None, test_size=None).
I’m not sure if it would make sense to pass a cv object here? or a CV class that’s internally instantiated? Or we could have stratify and grouped options? Maybe the last makes most sense?
Issue Analytics
- State:
- Created 4 years ago
- Reactions:2
- Comments:16 (9 by maintainers)

Top Related StackOverflow Question
While we’re at it, can we avoid making a function beginning “train” where it is not being used as a verb??? Why not “split_dataset” or “split_samples”?
We could allow:
(X_train, y_train), (X_test, y_test) = split_samples(X, y, test_size=.2) (X_train, y_train), (X_val, y_val), (X_test, y_test) = split_samples(X, y, val_size=.2, test_size=.2)
or even:
(X_train, y_train), (X_val, y_val), (X_test, y_test) = split_samples(X, y, test_size=(.2, .2))
@amueller @NicolasHug , I think the name
train_test_val_splitseems intuitive enough .For the
train_test_val_split(X=None, y=None, fit_parms=None, train_size=None, val_size=None, test_size=None). It is better to fix at least two values, ideally from whichtest_sizeandval_sizeseem apt to me. We can havetrain_sizejust be the complement of those arguments.I think
check_cvshould also be included to identify theYas a regression or classification and eventually do stratified sampling if the user wants to.