Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffling automatically set to True in Config API - not compatible with iterable datasets

See original GitHub issue

🐛 Bug Report

In the config API it appears that it is assumed that the train dataloader should be shuffled, as per the following line: https://github.com/catalyst-team/catalyst/blob/a7bc302a762d7d9f462ded6d9cd6ae70f8b656aa/catalyst/utils/data.py#L201

This behaviour is particularly undesirable when using iterable datasets, as they are incompatible with shuffle=True. It would probably be better to let the user specify the desired value of shuffle in the config.yml file. At the moment, if the user passes: loaders_params: {"shuffle": False} it is overwritten in the mentioned line, which leads to the pytorch error:

ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

How To Reproduce

Steps to reproduce the behavior:

Create an iterable dataset that is created in customRunner.get_datasets()
Use the config API to create a loader for the dataset with e.g. the following params:

 loaders: &loaders     
   batch_size: None  
   num_workers: 0  
   drop_last: False 
   per_gpu_scaling: False  
   loaders_params: {"shuffle": False}

See the following pytorch error:

ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

Expected behavior

When passing loaders_params: {"shuffle": False} one would expect shuffling to be turned off for both loaders.

Environment

Catalyst version: 21.03.2
PyTorch version: 1.7.1
TensorFlow version: N/A
TensorBoard version: 2.4.1
OS: Ubuntu 16.04.6 LTS
Python version: 3.7
Nvidia driver version: 460.32.03

Checklist

[ x] bug description
[ x] steps to reproduce
[ x] expected behavior
[ x] environment
code sample / screenshots

Issue Analytics

State:
Created 2 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Scitatorcommented, Apr 22, 2021

Oh, I see, as a possible workaround:

loaders: &loaders     
  batch_size: None  
  num_workers: 0  
  drop_last: False 
  per_gpu_scaling: False  
  loaders_params:
    train: {"shuffle": False}
    valid:  {"shuffle": False}
    other_loader_key: {"shuffle": False}

Nevetheless, it would be great if you could inject a hotfix here.

0reactions

stale[bot]commented, Jun 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Top Results From Across the Web

pytorch/dataloader.py at master - GitHub

Combines a dataset and a sampler, and provides an iterable over ... shuffle (bool, optional): set to ``True`` to have the data reshuffled....

tf.data.Dataset | TensorFlow v2.11.0

Represents a potentially large set of elements.

torch.utils.data — PyTorch 1.13 documentation

It represents a Python iterable over a dataset, with support for ... Instead, we recommend using automatic memory pinning (i.e., setting pin_memory=True ) ......

Managing Data — PyTorch Lightning 1.8.5.post0 documentation

The PyTorch DataLoader represents a Python iterable over a Dataset. ... which Lightning will automatically combine the batches from different DataLoaders.

Main classes — datasets 1.17.0 documentation - Hugging Face

If a formatting is set with Dataset.set_format() rows will be returned ... the dataset will not be copied in-memory unless explicitly enabled by...