Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

add three-fold split method train/val/test

See original GitHub issue

Asking how to do a threefold split is the top sklearn question on stackoverflow: https://stackoverflow.com/questions/tagged/scikit-learn?sort=frequent&pageSize=50

We have discussed this before but I think this is a good reason to add it - the other option would be to document more explictly doing

from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

if that’s the idiomatic way to do that. The “issue” with that way is that it’s harder to figure out the ratios.

If we want to add a threefold split method, there’s three options:

add another parameter to train_test_split (not sure I like this)
create a new method that mirrors the interface of train_test_split
create a new method with a better interface than train_test_split

I kinda prefer the last one but it might be confusing to users. My ideal signature would not have *args and explicitly name X and y so we could stratify by default. Maybe naming_is_hard(X=None, y=None, fit_parms=None, train_size=None, val_size=None, test_size=None).

I’m not sure if it would make sense to pass a cv object here? or a CV class that’s internally instantiated? Or we could have stratify and grouped options? Maybe the last makes most sense?

Issue Analytics

State:
Created 4 years ago
Reactions:2
Comments:16 (9 by maintainers)

Top GitHub Comments

2reactions

jnothmancommented, Jun 3, 2019

While we’re at it, can we avoid making a function beginning “train” where it is not being used as a verb??? Why not “split_dataset” or “split_samples”?

We could allow:

(X_train, y_train), (X_test, y_test) = split_samples(X, y, test_size=.2) (X_train, y_train), (X_val, y_val), (X_test, y_test) = split_samples(X, y, val_size=.2, test_size=.2)

or even:

(X_train, y_train), (X_val, y_val), (X_test, y_test) = split_samples(X, y, test_size=(.2, .2))

1reaction

SharathGacommented, Jun 1, 2019

@amueller @NicolasHug , I think the name train_test_val_split seems intuitive enough .

For the train_test_val_split(X=None, y=None, fit_parms=None, train_size=None, val_size=None, test_size=None) . It is better to fix at least two values, ideally from which test_size and val_size seem apt to me. We can have train_size just be the complement of those arguments.

I think check_cv should also be included to identify the Y as a regression or classification and eventually do stratified sampling if the user wants to.

Top Results From Across the Web

How to split data into three sets (train, validation, and test) And ...

Train-Valid-Test split is a technique to evaluate the performance of your machine learning model — classification or regression alike.

Burak | Kaggle

The goal of this homework is three-fold: Introduction to the Transfer Learning (Part-A); Gain experience with three dimensional input data (colored images), and ......

3 Modeling and Evaluation | Predictive Analytics - Bookdown

The most typical method is to divide your data into chunks you use to learn and chunks you use to evaluate. While this...

How to split data into 3 sets (train, validation and test)?

Here is a Python function that splits a Pandas dataframe into train, ... 1.0: raise ValueError('fractions %f, %f, %f do not add up...

Adapting to Local Patterns for Improving Graph Neural Networks

non-overlapped partition methods to split the original graph into ... classifier, and label propagation modules are added to improve.