question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Stratified K Fold and Dataset

See original GitHub issue

🐛 Bug

Hi there! The Pytorch geometric Dataset object used to work nicely with scikit-learn’s StratifiedKFold. See an example below:

 kf2 = StratifiedKFold(n_splits=9, shuffle=False)
 for train_idxs, val_idxs in kf2.split(dataset, dataset.data.y):
     train_dataset = dataset[torch.LongTensor(train_idx)]

In version 1.3.2 the dataset slicing used to also affect dataset.data.y. However, slicing now seems to leave this unchanged. So if one wants to further split train_dataset using StratifiedKFold, this is not working anymore because len(train_dataset) and len(train_dataset.data.y) won’t match.

To Reproduce

Steps to reproduce the behavior:

  1. Load Proteins dataset from PyTorch geometric
  2. Check data len(dataset[indices]) is different from len(dataset[indices].data.y).

Is this the new expected behaviour and is there any elegant workaround for this?

I think the older behaviour from version 1.3.2 was more natural.

Environment

  • OS: MacOS
  • Python version: 3.6
  • PyTorch version: 1.5
  • CUDA/cuDNN version: cpu

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
rusty1scommented, Aug 26, 2021

Yes, that is correct. This should be already fixed in later versions.

1reaction
rusty1scommented, May 14, 2020

I just added a dataset.copy() method that converts sliced datasets back to a contiguous layout:

from torch_geometric.datasets import TUDataset

dataset = TUDataset('/tmp/TUDataset', 'MUTAG')
print(len(dataset))
>>> 180
print(dataset.data.y.size())
>>> torch.Size([188])

dataset = dataset[:40]
print(len(dataset))
>>> 40
print(dataset.data.y.size())  # <- invalid access
>>> torch.Size([188])

dataset = dataset.copy()
print(len(dataset))
>>> 40
print(dataset.data.y.size())  # <- correct access
>>> torch.Size([40])
Read more comments on GitHub >

github_iconTop Results From Across the Web

Stratified K Fold Cross Validation
This is random sampling. But in Stratified Sampling, Let the population for that state be 51.3% male and 48.7% female, Then for choosing...
Read more >
Stratified KFold Tutorial
Stratified kfold cross validation is an extension of regular kfold cross validation but specifically for classification problems where rather than the splits ...
Read more >
sklearn.model_selection.StratifiedKFold
This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for...
Read more >
Hands-On Tutorial on Performance Measure of Stratified K- ...
The most used validation technique is K-Fold Cross-validation which involves splitting the training dataset into k folds.
Read more >
Stratified K-Fold Cross-Validation on Grouped Datasets
To use K-Fold cross-validation, we split the source dataset into K partitions. We use K-1 as the training set and the remaining one...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found