question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffle does not work properly in Kfolds (stratified or not)

See original GitHub issue

Description

When StratifiedKFold or KFold are used with shuffle=True, the resulting data keeps original order and is not shuffled. For example, if we have data[1…20] and we need 2 folds, each fold will look like data[i1],…,data[i10] where i1<i2<…<i10. This creates a problem when data distribution is not random; for instance, when binary classification is required and negative samples precede positive samples, it will happen in every fold too, which will cause the network to essentially ignore negative samples and its accuracy will drop dramatically.

Steps/Code to Reproduce

import os
import sys
import sklearn
import numpy as np
from numpy import array
from numpy import asarray
from numpy import zeros
from numpy import argmax, mean, std
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import keras.backend as K
import random
from sklearn.model_selection import StratifiedKFold

def getDataAndLabels(size):
    
    data = np.random.rand(size)
    labels = np.empty(size)
    
    for i in range(size):
        if i<size/2:
            labels[i]=0
        else:
            labels[i]=1
    
    return data, labels


X, Y = getDataAndLabels(100)

#now to k-fold split
seed=42
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
    print("   ==============> StratifiedKFold fold #",i," shuffle=True")
    print("   ==============> StratifiedKFold train indexes =",train)
    print("   ==============> StratifiedKFold test  indexes =",test)
    i=i+1
    
    
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
    print("   ==============> KFold fold #",i," shuffle=true")
    print("   ==============> KFold train indexes =",train)
    print("   ==============> KFold test  indexes =",test)
    i=i+1
    
kfold = KFold(n_splits=10, shuffle=False, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
    print("   ==============> KFold fold #",i," shuffle=false")
    print("   ==============> KFold train indexes =",train)
    print("   ==============> KFold test  indexes =",test)
    i=i+1

print("Done!")

Expected Results

Shuffled indexes of data for a specific fold, similar to: [ 37 4 56 3 8 87 53 12 54 62]

Actual Results

Unshuffled indexes of data for a specific fold: [ 3 4 8 12 37 53 54 58 62 87]

Versions

Version 0.20.3

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
jnothmancommented, Jan 9, 2020

I think this should be noted in the definitions of KFold.shuffle etc, because shuffle gives the impression otherwise. However, I’m not sure that many users will see this comment 😃

PR welcome, thanks for raising this issue

0reactions
jmugancommented, Jun 21, 2021

I came here to figure out what “Note that the samples within each split will not be shuffled.” means since it wasn’t clear to me. I now understand, reading the discussion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

When should I shuffle in StratifiedKFold - Stack Overflow
When working with time series data you are correct that shuffling will inflate the accuracy. The reason is because shuffling the training ...
Read more >
Shuffle your dataset when using cross_val_score - YouTube
If you use cross-validation and your samples are NOT in an arbitrary order, shuffling may be required to get meaningful results.
Read more >
Why do the results in cross validation changes whenever I ...
I tried to shuffle my training data then applied the CV function (shuffle then CV). I did that for several times and each...
Read more >
sklearn.model_selection.StratifiedKFold
Whether to shuffle each class's samples before splitting into batches. Note that the samples within each split will not be shuffled.
Read more >
How to Fix k-Fold Cross-Validation for Imbalanced Classification
Sadly, the k-fold cross-validation is not appropriate for evaluating ... This might work fine for data with a balanced class distribution, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found