Shuffle does not work properly in Kfolds (stratified or not)
See original GitHub issueDescription
When StratifiedKFold or KFold are used with shuffle=True, the resulting data keeps original order and is not shuffled. For example, if we have data[1…20] and we need 2 folds, each fold will look like data[i1],…,data[i10] where i1<i2<…<i10. This creates a problem when data distribution is not random; for instance, when binary classification is required and negative samples precede positive samples, it will happen in every fold too, which will cause the network to essentially ignore negative samples and its accuracy will drop dramatically.
Steps/Code to Reproduce
import os
import sys
import sklearn
import numpy as np
from numpy import array
from numpy import asarray
from numpy import zeros
from numpy import argmax, mean, std
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import keras.backend as K
import random
from sklearn.model_selection import StratifiedKFold
def getDataAndLabels(size):
data = np.random.rand(size)
labels = np.empty(size)
for i in range(size):
if i<size/2:
labels[i]=0
else:
labels[i]=1
return data, labels
X, Y = getDataAndLabels(100)
#now to k-fold split
seed=42
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
print(" ==============> StratifiedKFold fold #",i," shuffle=True")
print(" ==============> StratifiedKFold train indexes =",train)
print(" ==============> StratifiedKFold test indexes =",test)
i=i+1
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
print(" ==============> KFold fold #",i," shuffle=true")
print(" ==============> KFold train indexes =",train)
print(" ==============> KFold test indexes =",test)
i=i+1
kfold = KFold(n_splits=10, shuffle=False, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
print(" ==============> KFold fold #",i," shuffle=false")
print(" ==============> KFold train indexes =",train)
print(" ==============> KFold test indexes =",test)
i=i+1
print("Done!")
Expected Results
Shuffled indexes of data for a specific fold, similar to: [ 37 4 56 3 8 87 53 12 54 62]
Actual Results
Unshuffled indexes of data for a specific fold: [ 3 4 8 12 37 53 54 58 62 87]
Versions
Version 0.20.3
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
When should I shuffle in StratifiedKFold - Stack Overflow
When working with time series data you are correct that shuffling will inflate the accuracy. The reason is because shuffling the training ...
Read more >Shuffle your dataset when using cross_val_score - YouTube
If you use cross-validation and your samples are NOT in an arbitrary order, shuffling may be required to get meaningful results.
Read more >Why do the results in cross validation changes whenever I ...
I tried to shuffle my training data then applied the CV function (shuffle then CV). I did that for several times and each...
Read more >sklearn.model_selection.StratifiedKFold
Whether to shuffle each class's samples before splitting into batches. Note that the samples within each split will not be shuffled.
Read more >How to Fix k-Fold Cross-Validation for Imbalanced Classification
Sadly, the k-fold cross-validation is not appropriate for evaluating ... This might work fine for data with a balanced class distribution, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I think this should be noted in the definitions of KFold.shuffle etc, because shuffle gives the impression otherwise. However, I’m not sure that many users will see this comment 😃
PR welcome, thanks for raising this issue
I came here to figure out what “Note that the samples within each split will not be shuffled.” means since it wasn’t clear to me. I now understand, reading the discussion.