Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

permutation_test_score shuffling when a group enforces a specific label

See original GitHub issue

Describe the bug

I believe there is a bug or design expectation misunderstanding with permutation_test_score label shuffle function code. Maybe there is a use case for only shuffling within groups, but to me the default use case should be to shuffle all the samples while ensuring all samples within a group have the same shuffled label. This would be the equivalent behavior as if no groups existed, i.e. each label was in it’s own unique group, which is the way I expected it to work.

Steps/Code to Reproduce

from sklearn.model_selection._validation import _shuffle
from sklearn.utils import check_random_state

y = [1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
groups = [1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

random_state = check_random_state(0)
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

print()

# this should work the same as if groups=None but it doesn't
groups = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

Expected Results

Labels should be shuffled even when there are label groups. Label groups are to make sure all labels in that group simply stay together during CV shuffling and I expected that behavior here too. The labels for a group should be treated like a single sample where they get shuffled together with another sample label group.

Actual Results

No labels get shuffled:

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]


[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Versions

1.0.1 and all previous versions with permutation_test_score (I looked at source code it hasn’t changed)

I specifically ran my test on:

System: python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] executable: /home/hermidalc/soft/miniconda3/envs/sklearn-bio-workflows-r36/bin/python machine: Linux-5.14.14-200.fc34.x86_64-x86_64-with-glibc2.10

Python dependencies: pip: 21.1.2 setuptools: 49.6.0.post20210108 sklearn: 0.22.2.post1 numpy: 1.21.0 scipy: 1.5.3 Cython: None pandas: 1.2.5 matplotlib: 3.4.2 joblib: 1.0.1

Built with OpenMP: True

Issue Analytics

State:
Created 2 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

glemaitrecommented, Nov 24, 2021

Thus not producing the same train/test splits on the data within the for loop of each parallel process?

Each call will produce different splits indeed. But I don’t understand why is it an issue. If really, you want the CV to be deterministic, you need to set the random_state of the CV object.

1reaction

glemaitrecommented, Nov 23, 2021

It works for me while still supporting the old behavior of only within group shuffling for the more complex use cases you described.

Yep, I saw it while answering. But I don’t know if we should only switch behaviour using a new parameter or magic would be fine in this case.

Top Results From Across the Web

How to shuffle condition labels to conduct permutation test

1). shuffle the condition labels, just like the method you described in your book when ... For more options, visit https://groups.google.com/d/optout.

Chapter 17 Shuffling labels to generate a null

In this chapter we focus in permuting to test for differences between two groups, but permutation is super flexible and can be used...

Sort a dataframe by a `label` column, shuffle per each ` ...

Per each label, rows are shuffled; But maintaining order based on the value. So for example, here is a possible outcome: value label...

A longitudinal study during the first months of COVID-19 ...

As prolonged periods of social distancing were enforced in some parts ... For the weekly sessions, a group shuffle split was used in...

Time experience during social distancing: A ... - NCBI

As prolonged periods of social distancing were enforced in some parts ... For the weekly sessions, a group shuffle split was used in...