Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API documentation update for train_test_split()

See original GitHub issue

Applied train_test_split on the imbalanced dataset (Credit Card Fraud dataset) without setting stratify parameter (None by default). When we checked the test and train data, the class distribution is maintained, i.e., stratification is applied.

Though stratification is applied by default, the document says following which is confusing for users:

stratify : array-like, default=None
    If not None, data is split in a stratified fashion, using this as
    the class labels.
    Read more in the :ref:`User Guide <stratification>`.

API documentation of train_test_split should be updated to reflect the exact behaviour of stratify

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

NicolasHugcommented, Jul 26, 2021

If shuffle=false, then none is the only option, no stratification applied.

Yes

is it good to say on the behavior, that stratification is applied if it is specified, given shuffle=true

Yes, if stratify != None then shuffle must be True (and you should probably get an error if you keep it to False)

1reaction

NicolasHugcommented, Jul 26, 2021

It’s not a contradiction @brgopalakrishnan , the doc says

shuffle=False => stratify=None

but this doesn’t imply

not(shuffle=False) => not(stratify=None)

A => B is equivalent to not(B) => not(A) but it doesn’t imply not(A) => not(B)

I’ll close the issue since I think I addressed the original issue about imbalance

Top Results From Across the Web

DataOperationsCatalog.TrainTestSplit Method (Microsoft.ML)

Split the dataset into the train set and test set according to the given fraction. Respects the samplingKeyColumnName if provided.

Support stratify in TrainTestSplit() API #4082 - GitHub

In ML.NET in the TrainTestSplit() API we have the samplingKeyColumnName, but that's kind of the opposite to 'Stratification column':. Name of a ...

sklearn.model_selection.train_test_split

Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)) , and application to input data ... List containing train-test split of inputs.

Split Your Dataset With scikit-learn's train_test_split()

Using train_test_split() from the data science library scikit-learn, you can split ... then take a look at the official documentation or check out...

Train-Test Split for Evaluating Machine Learning Algorithms

Last Updated on August 26, 2020. The train-test split procedure is used to estimate the performance of machine learning algorithms when they ...