question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffling of IterableDataset disabled when dataloader is passed to accelerator.prepare()

See original GitHub issue

Hi, I have an issue with shuffling an IterableDataset that is passed to a dataloader in accelerate. I am trying to use ShufflerIterDataPipe from torch.utils.data.datapipes.iter.combinatorics. As you can see below, it looks like the shuffling is disabled:

from accelerate import Accelerator
from datasets import load_dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.datapipes.iter.combinatorics import ShufflerIterDataPipe

data = load_dataset("lvwerra/codeparrot-clean-valid", streaming=True, split="train")
shuffled_data = ShufflerIterDataPipe(data, buffer_size=100)
dataloader = DataLoader(shuffled_data, batch_size=4, shuffle=True)
print(f"Before accelerate prepare: {dataloader.dataset._shuffle_enabled}")

accelerator = Accelerator()
dataloader = accelerator.prepare(dataloader)
print(f"After accelerate prepare: {dataloader.dataset._shuffle_enabled}")

Output:

Before accelerate prepare: True
After accelerate prepare: False

I think it’s because ShufflerIterDataPipe returns a new Pytorch IterableDataset that is also an IterDataPipe. When such instance is passed to a Pytorch dataloader, for the shuffling to be enabled we must specify shuffle=True in the dataloader, the distinction is done here. I don’t think this is handled during the dataloader preparation in accelerate.

Python 3.9.12
torch=1.11.0
accelerate=0.8.0

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, May 19, 2022

Happy to review!

0reactions
loubnabnlcommented, May 19, 2022

It works with torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle) I can make a PR to add this

Read more comments on GitHub >

github_iconTop Results From Across the Web

Quick tour - Hugging Face
Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the prepare() method. This will make sure ...
Read more >
How to shuffle an iterable dataset - PyTorch Forums
Hi, I am using the IterableDataset class in order to avoid loading the whole data to memory. However, I cannot shuffle the dataset...
Read more >
Source code for pytorch_lightning.trainer.trainer
Args: accelerator: Supports passing different accelerator types ("cpu", "gpu", ... will make trainer.tune() run a learning rate finder, trying to optimize ...
Read more >
ray.train.torch.train_loop_utils — Ray 3.0.0.dev0
return get_accelerator(_TorchAccelerator).get_device() ... DataLoader: """Prepares DataLoader for distributed execution. This allows you to use the same ...
Read more >
NeMo Models - NVIDIA Documentation Center
super().__init__(cfg=cfg, trainer=trainer) # instantiate a BERT based encoder ... loop with the data from the training dataloader passed in as `batch`.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found