question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

API documentation update for train_test_split()

See original GitHub issue

Applied train_test_split on the imbalanced dataset (Credit Card Fraud dataset) without setting stratify parameter (None by default). When we checked the test and train data, the class distribution is maintained, i.e., stratification is applied.

Though stratification is applied by default, the document says following which is confusing for users:

stratify : array-like, default=None
    If not None, data is split in a stratified fashion, using this as
    the class labels.
    Read more in the :ref:`User Guide <stratification>`.

API documentation of train_test_split should be updated to reflect the exact behaviour of stratify

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasHugcommented, Jul 26, 2021

If shuffle=false, then none is the only option, no stratification applied.

Yes

is it good to say on the behavior, that stratification is applied if it is specified, given shuffle=true

Yes, if stratify != None then shuffle must be True (and you should probably get an error if you keep it to False)

1reaction
NicolasHugcommented, Jul 26, 2021

It’s not a contradiction @brgopalakrishnan , the doc says

shuffle=False => stratify=None

but this doesn’t imply

not(shuffle=False) => not(stratify=None)

A => B is equivalent to not(B) => not(A) but it doesn’t imply not(A) => not(B)

I’ll close the issue since I think I addressed the original issue about imbalance

Read more comments on GitHub >

github_iconTop Results From Across the Web

DataOperationsCatalog.TrainTestSplit Method (Microsoft.ML)
Split the dataset into the train set and test set according to the given fraction. Respects the samplingKeyColumnName if provided.
Read more >
Support stratify in TrainTestSplit() API #4082 - GitHub
In ML.NET in the TrainTestSplit() API we have the samplingKeyColumnName, but that's kind of the opposite to 'Stratification column':. Name of a ...
Read more >
sklearn.model_selection.train_test_split
Quick utility that wraps input validation, next(ShuffleSplit().split(X, y)) , and application to input data ... List containing train-test split of inputs.
Read more >
Split Your Dataset With scikit-learn's train_test_split()
Using train_test_split() from the data science library scikit-learn, you can split ... then take a look at the official documentation or check out...
Read more >
Train-Test Split for Evaluating Machine Learning Algorithms
Last Updated on August 26, 2020. The train-test split procedure is used to estimate the performance of machine learning algorithms when they ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found