Stratified K Fold and Dataset
See original GitHub issue🐛 Bug
Hi there! The Pytorch geometric Dataset object used to work nicely with scikit-learn’s StratifiedKFold. See an example below:
kf2 = StratifiedKFold(n_splits=9, shuffle=False)
for train_idxs, val_idxs in kf2.split(dataset, dataset.data.y):
train_dataset = dataset[torch.LongTensor(train_idx)]
In version 1.3.2 the dataset slicing used to also affect dataset.data.y. However, slicing now seems to leave this unchanged. So if one wants to further split train_dataset using StratifiedKFold, this is not working anymore because len(train_dataset) and len(train_dataset.data.y) won’t match.
To Reproduce
Steps to reproduce the behavior:
- Load Proteins dataset from PyTorch geometric
- Check data len(dataset[indices]) is different from len(dataset[indices].data.y).
Is this the new expected behaviour and is there any elegant workaround for this?
I think the older behaviour from version 1.3.2 was more natural.
Environment
- OS: MacOS
- Python version: 3.6
- PyTorch version: 1.5
- CUDA/cuDNN version: cpu
Issue Analytics
- State:
- Created 3 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Stratified K Fold Cross Validation
This is random sampling. But in Stratified Sampling, Let the population for that state be 51.3% male and 48.7% female, Then for choosing...
Read more >Stratified KFold Tutorial
Stratified kfold cross validation is an extension of regular kfold cross validation but specifically for classification problems where rather than the splits ...
Read more >sklearn.model_selection.StratifiedKFold
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for...
Read more >Hands-On Tutorial on Performance Measure of Stratified K- ...
The most used validation technique is K-Fold Cross-validation which involves splitting the training dataset into k folds.
Read more >Stratified K-Fold Cross-Validation on Grouped Datasets
To use K-Fold cross-validation, we split the source dataset into K partitions. We use K-1 as the training set and the remaining one...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, that is correct. This should be already fixed in later versions.
I just added a
dataset.copy()
method that converts sliced datasets back to a contiguous layout: