Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stratified K Fold and Dataset

See original GitHub issue

🐛 Bug

Hi there! The Pytorch geometric Dataset object used to work nicely with scikit-learn’s StratifiedKFold. See an example below:

 kf2 = StratifiedKFold(n_splits=9, shuffle=False)
 for train_idxs, val_idxs in kf2.split(dataset, dataset.data.y):
     train_dataset = dataset[torch.LongTensor(train_idx)]

In version 1.3.2 the dataset slicing used to also affect dataset.data.y. However, slicing now seems to leave this unchanged. So if one wants to further split train_dataset using StratifiedKFold, this is not working anymore because len(train_dataset) and len(train_dataset.data.y) won’t match.

To Reproduce

Steps to reproduce the behavior:

Load Proteins dataset from PyTorch geometric
Check data len(dataset[indices]) is different from len(dataset[indices].data.y).

Is this the new expected behaviour and is there any elegant workaround for this?

I think the older behaviour from version 1.3.2 was more natural.

Environment

OS: MacOS
Python version: 3.6
PyTorch version: 1.5
CUDA/cuDNN version: cpu

Issue Analytics

State:
Created 3 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

rusty1scommented, Aug 26, 2021

Yes, that is correct. This should be already fixed in later versions.

1reaction

rusty1scommented, May 14, 2020

I just added a dataset.copy() method that converts sliced datasets back to a contiguous layout:

from torch_geometric.datasets import TUDataset

dataset = TUDataset('/tmp/TUDataset', 'MUTAG')
print(len(dataset))
>>> 180
print(dataset.data.y.size())
>>> torch.Size([188])

dataset = dataset[:40]
print(len(dataset))
>>> 40
print(dataset.data.y.size())  # <- invalid access
>>> torch.Size([188])

dataset = dataset.copy()
print(len(dataset))
>>> 40
print(dataset.data.y.size())  # <- correct access
>>> torch.Size([40])

Top Results From Across the Web

Stratified K Fold Cross Validation

This is random sampling. But in Stratified Sampling, Let the population for that state be 51.3% male and 48.7% female, Then for choosing...

Stratified KFold Tutorial

Stratified kfold cross validation is an extension of regular kfold cross validation but specifically for classification problems where rather than the splits ...

sklearn.model_selection.StratifiedKFold

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for...

Hands-On Tutorial on Performance Measure of Stratified K- ...

The most used validation technique is K-Fold Cross-validation which involves splitting the training dataset into k folds.

Stratified K-Fold Cross-Validation on Grouped Datasets

To use K-Fold cross-validation, we split the source dataset into K partitions. We use K-1 as the training set and the remaining one...