Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffle does not work properly in Kfolds (stratified or not)

See original GitHub issue

Description

When StratifiedKFold or KFold are used with shuffle=True, the resulting data keeps original order and is not shuffled. For example, if we have data[1…20] and we need 2 folds, each fold will look like data[i1],…,data[i10] where i1<i2<…<i10. This creates a problem when data distribution is not random; for instance, when binary classification is required and negative samples precede positive samples, it will happen in every fold too, which will cause the network to essentially ignore negative samples and its accuracy will drop dramatically.

Steps/Code to Reproduce

import os
import sys
import sklearn
import numpy as np
from numpy import array
from numpy import asarray
from numpy import zeros
from numpy import argmax, mean, std
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import keras.backend as K
import random
from sklearn.model_selection import StratifiedKFold

def getDataAndLabels(size):
    
    data = np.random.rand(size)
    labels = np.empty(size)
    
    for i in range(size):
        if i<size/2:
            labels[i]=0
        else:
            labels[i]=1
    
    return data, labels


X, Y = getDataAndLabels(100)

#now to k-fold split
seed=42
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
    print("   ==============> StratifiedKFold fold #",i," shuffle=True")
    print("   ==============> StratifiedKFold train indexes =",train)
    print("   ==============> StratifiedKFold test  indexes =",test)
    i=i+1
    
    
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
    print("   ==============> KFold fold #",i," shuffle=true")
    print("   ==============> KFold train indexes =",train)
    print("   ==============> KFold test  indexes =",test)
    i=i+1
    
kfold = KFold(n_splits=10, shuffle=False, random_state=seed)
i=0
for train, test in kfold.split(X, Y):
    print("   ==============> KFold fold #",i," shuffle=false")
    print("   ==============> KFold train indexes =",train)
    print("   ==============> KFold test  indexes =",test)
    i=i+1

print("Done!")

Expected Results

Shuffled indexes of data for a specific fold, similar to: [ 37 4 56 3 8 87 53 12 54 62]

Actual Results

Unshuffled indexes of data for a specific fold: [ 3 4 8 12 37 53 54 58 62 87]

Versions

Version 0.20.3

Issue Analytics

State:
Created 4 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

jnothmancommented, Jan 9, 2020

I think this should be noted in the definitions of KFold.shuffle etc, because shuffle gives the impression otherwise. However, I’m not sure that many users will see this comment 😃

PR welcome, thanks for raising this issue

0reactions

jmugancommented, Jun 21, 2021

I came here to figure out what “Note that the samples within each split will not be shuffled.” means since it wasn’t clear to me. I now understand, reading the discussion.