question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffling automatically set to True in Config API - not compatible with iterable datasets

See original GitHub issue

🐛 Bug Report

In the config API it appears that it is assumed that the train dataloader should be shuffled, as per the following line: https://github.com/catalyst-team/catalyst/blob/a7bc302a762d7d9f462ded6d9cd6ae70f8b656aa/catalyst/utils/data.py#L201

This behaviour is particularly undesirable when using iterable datasets, as they are incompatible with shuffle=True. It would probably be better to let the user specify the desired value of shuffle in the config.yml file. At the moment, if the user passes: loaders_params: {"shuffle": False} it is overwritten in the mentioned line, which leads to the pytorch error:

ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

How To Reproduce

Steps to reproduce the behavior:

  1. Create an iterable dataset that is created in customRunner.get_datasets()
  2. Use the config API to create a loader for the dataset with e.g. the following params:
 loaders: &loaders     
   batch_size: None  
   num_workers: 0  
   drop_last: False 
   per_gpu_scaling: False  
   loaders_params: {"shuffle": False} 
  1. See the following pytorch error:

ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True

Expected behavior

When passing loaders_params: {"shuffle": False} one would expect shuffling to be turned off for both loaders.

Environment

Catalyst version: 21.03.2
PyTorch version: 1.7.1
TensorFlow version: N/A
TensorBoard version: 2.4.1
OS: Ubuntu 16.04.6 LTS
Python version: 3.7
Nvidia driver version: 460.32.03

Checklist

  • [ x] bug description
  • [ x] steps to reproduce
  • [ x] expected behavior
  • [ x] environment
  • code sample / screenshots

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Scitatorcommented, Apr 22, 2021

Oh, I see, as a possible workaround:

loaders: &loaders     
  batch_size: None  
  num_workers: 0  
  drop_last: False 
  per_gpu_scaling: False  
  loaders_params:
    train: {"shuffle": False}
    valid:  {"shuffle": False}
    other_loader_key: {"shuffle": False}

Nevetheless, it would be great if you could inject a hotfix here.

0reactions
stale[bot]commented, Jun 28, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

pytorch/dataloader.py at master - GitHub
Combines a dataset and a sampler, and provides an iterable over ... shuffle (bool, optional): set to ``True`` to have the data reshuffled....
Read more >
tf.data.Dataset | TensorFlow v2.11.0
Represents a potentially large set of elements.
Read more >
torch.utils.data — PyTorch 1.13 documentation
It represents a Python iterable over a dataset, with support for ... Instead, we recommend using automatic memory pinning (i.e., setting pin_memory=True ) ......
Read more >
Managing Data — PyTorch Lightning 1.8.5.post0 documentation
The PyTorch DataLoader represents a Python iterable over a Dataset. ... which Lightning will automatically combine the batches from different DataLoaders.
Read more >
Main classes — datasets 1.17.0 documentation - Hugging Face
If a formatting is set with Dataset.set_format() rows will be returned ... the dataset will not be copied in-memory unless explicitly enabled by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found