question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

permutation_test_score shuffling when a group enforces a specific label

See original GitHub issue

Describe the bug

I believe there is a bug or design expectation misunderstanding with permutation_test_score label shuffle function code. Maybe there is a use case for only shuffling within groups, but to me the default use case should be to shuffle all the samples while ensuring all samples within a group have the same shuffled label. This would be the equivalent behavior as if no groups existed, i.e. each label was in it’s own unique group, which is the way I expected it to work.

Steps/Code to Reproduce

from sklearn.model_selection._validation import _shuffle
from sklearn.utils import check_random_state

y = [1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
groups = [1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

random_state = check_random_state(0)
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

print()

# this should work the same as if groups=None but it doesn't
groups = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for _ in range(10):
    print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")

Expected Results

Labels should be shuffled even when there are label groups. Label groups are to make sure all labels in that group simply stay together during CV shuffling and I expected that behavior here too. The labels for a group should be treated like a single sample where they get shuffled together with another sample label group.

Actual Results

No labels get shuffled:

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]


[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Versions

1.0.1 and all previous versions with permutation_test_score (I looked at source code it hasn’t changed)

I specifically ran my test on:

System: python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] executable: /home/hermidalc/soft/miniconda3/envs/sklearn-bio-workflows-r36/bin/python machine: Linux-5.14.14-200.fc34.x86_64-x86_64-with-glibc2.10

Python dependencies: pip: 21.1.2 setuptools: 49.6.0.post20210108 sklearn: 0.22.2.post1 numpy: 1.21.0 scipy: 1.5.3 Cython: None pandas: 1.2.5 matplotlib: 3.4.2 joblib: 1.0.1

Built with OpenMP: True

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
glemaitrecommented, Nov 24, 2021

Thus not producing the same train/test splits on the data within the for loop of each parallel process?

Each call will produce different splits indeed. But I don’t understand why is it an issue. If really, you want the CV to be deterministic, you need to set the random_state of the CV object.

1reaction
glemaitrecommented, Nov 23, 2021

It works for me while still supporting the old behavior of only within group shuffling for the more complex use cases you described.

Yep, I saw it while answering. But I don’t know if we should only switch behaviour using a new parameter or magic would be fine in this case.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to shuffle condition labels to conduct permutation test
1). shuffle the condition labels, just like the method you described in your book when ... For more options, visit https://groups.google.com/d/optout.
Read more >
Chapter 17 Shuffling labels to generate a null
In this chapter we focus in permuting to test for differences between two groups, but permutation is super flexible and can be used...
Read more >
Sort a dataframe by a `label` column, shuffle per each ` ...
Per each label, rows are shuffled; But maintaining order based on the value. So for example, here is a possible outcome: value label...
Read more >
A longitudinal study during the first months of COVID-19 ...
As prolonged periods of social distancing were enforced in some parts ... For the weekly sessions, a group shuffle split was used in...
Read more >
Time experience during social distancing: A ... - NCBI
As prolonged periods of social distancing were enforced in some parts ... For the weekly sessions, a group shuffle split was used in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found