permutation_test_score shuffling when a group enforces a specific label
See original GitHub issueDescribe the bug
I believe there is a bug or design expectation misunderstanding with permutation_test_score
label shuffle function code. Maybe there is a use case for only shuffling within groups, but to me the default use case should be to shuffle all the samples while ensuring all samples within a group have the same shuffled label. This would be the equivalent behavior as if no groups existed, i.e. each label was in it’s own unique group, which is the way I expected it to work.
Steps/Code to Reproduce
from sklearn.model_selection._validation import _shuffle
from sklearn.utils import check_random_state
y = [1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
groups = [1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
random_state = check_random_state(0)
for _ in range(10):
print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")
print()
# this should work the same as if groups=None but it doesn't
groups = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for _ in range(10):
print(_shuffle(y, groups, random_state), "\n", groups, "\n", sep="")
Expected Results
Labels should be shuffled even when there are label groups. Label groups are to make sure all labels in that group simply stay together during CV shuffling and I expected that behavior here too. The labels for a group should be treated like a single sample where they get shuffled together with another sample label group.
Actual Results
No labels get shuffled:
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 4, 5, 6, 7, 7, 8]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Versions
1.0.1 and all previous versions with permutation_test_score
(I looked at source code it hasn’t changed)
I specifically ran my test on:
System: python: 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05) [GCC 9.3.0] executable: /home/hermidalc/soft/miniconda3/envs/sklearn-bio-workflows-r36/bin/python machine: Linux-5.14.14-200.fc34.x86_64-x86_64-with-glibc2.10
Python dependencies: pip: 21.1.2 setuptools: 49.6.0.post20210108 sklearn: 0.22.2.post1 numpy: 1.21.0 scipy: 1.5.3 Cython: None pandas: 1.2.5 matplotlib: 3.4.2 joblib: 1.0.1
Built with OpenMP: True
Issue Analytics
- State:
- Created 2 years ago
- Comments:15 (15 by maintainers)
Top GitHub Comments
Each call will produce different splits indeed. But I don’t understand why is it an issue. If really, you want the CV to be deterministic, you need to set the
random_state
of the CV object.Yep, I saw it while answering. But I don’t know if we should only switch behaviour using a new parameter or magic would be fine in this case.