question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

add three-fold split method train/val/test

See original GitHub issue

Asking how to do a threefold split is the top sklearn question on stackoverflow: https://stackoverflow.com/questions/tagged/scikit-learn?sort=frequent&pageSize=50

We have discussed this before but I think this is a good reason to add it - the other option would be to document more explictly doing

from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

if that’s the idiomatic way to do that. The “issue” with that way is that it’s harder to figure out the ratios.

If we want to add a threefold split method, there’s three options:

  • add another parameter to train_test_split (not sure I like this)
  • create a new method that mirrors the interface of train_test_split
  • create a new method with a better interface than train_test_split

I kinda prefer the last one but it might be confusing to users. My ideal signature would not have *args and explicitly name X and y so we could stratify by default. Maybe naming_is_hard(X=None, y=None, fit_parms=None, train_size=None, val_size=None, test_size=None).

I’m not sure if it would make sense to pass a cv object here? or a CV class that’s internally instantiated? Or we could have stratify and grouped options? Maybe the last makes most sense?

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:2
  • Comments:16 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
jnothmancommented, Jun 3, 2019

While we’re at it, can we avoid making a function beginning “train” where it is not being used as a verb??? Why not “split_dataset” or “split_samples”?

We could allow:

(X_train, y_train), (X_test, y_test) = split_samples(X, y, test_size=.2) (X_train, y_train), (X_val, y_val), (X_test, y_test) = split_samples(X, y, val_size=.2, test_size=.2)

or even:

(X_train, y_train), (X_val, y_val), (X_test, y_test) = split_samples(X, y, test_size=(.2, .2))

1reaction
SharathGacommented, Jun 1, 2019

@amueller @NicolasHug , I think the name train_test_val_split seems intuitive enough .

For the train_test_val_split(X=None, y=None, fit_parms=None, train_size=None, val_size=None, test_size=None) . It is better to fix at least two values, ideally from which test_size and val_size seem apt to me. We can have train_size just be the complement of those arguments.

I think check_cv should also be included to identify the Y as a regression or classification and eventually do stratified sampling if the user wants to.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to split data into three sets (train, validation, and test) And ...
Train-Valid-Test split is a technique to evaluate the performance of your machine learning model — classification or regression alike.
Read more >
Burak | Kaggle
The goal of this homework is three-fold: Introduction to the Transfer Learning (Part-A); Gain experience with three dimensional input data (colored images), and ......
Read more >
3 Modeling and Evaluation | Predictive Analytics - Bookdown
The most typical method is to divide your data into chunks you use to learn and chunks you use to evaluate. While this...
Read more >
How to split data into 3 sets (train, validation and test)?
Here is a Python function that splits a Pandas dataframe into train, ... 1.0: raise ValueError('fractions %f, %f, %f do not add up...
Read more >
Adapting to Local Patterns for Improving Graph Neural Networks
non-overlapped partition methods to split the original graph into ... classifier, and label propagation modules are added to improve.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found