Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Shuffling of IterableDataset disabled when dataloader is passed to accelerator.prepare()

See original GitHub issue

Hi, I have an issue with shuffling an IterableDataset that is passed to a dataloader in accelerate. I am trying to use ShufflerIterDataPipe from torch.utils.data.datapipes.iter.combinatorics. As you can see below, it looks like the shuffling is disabled:

from accelerate import Accelerator
from datasets import load_dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.datapipes.iter.combinatorics import ShufflerIterDataPipe

data = load_dataset("lvwerra/codeparrot-clean-valid", streaming=True, split="train")
shuffled_data = ShufflerIterDataPipe(data, buffer_size=100)
dataloader = DataLoader(shuffled_data, batch_size=4, shuffle=True)
print(f"Before accelerate prepare: {dataloader.dataset._shuffle_enabled}")

accelerator = Accelerator()
dataloader = accelerator.prepare(dataloader)
print(f"After accelerate prepare: {dataloader.dataset._shuffle_enabled}")

Output:

Before accelerate prepare: True
After accelerate prepare: False

I think it’s because ShufflerIterDataPipe returns a new Pytorch IterableDataset that is also an IterDataPipe. When such instance is passed to a Pytorch dataloader, for the shuffling to be enabled we must specify shuffle=True in the dataloader, the distinction is done here. I don’t think this is handled during the dataloader preparation in accelerate.

Python 3.9.12
torch=1.11.0
accelerate=0.8.0

compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

Issue Analytics

State:
Created a year ago
Comments:7 (7 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, May 19, 2022

Happy to review!

0reactions

loubnabnlcommented, May 19, 2022

It works with torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle) I can make a PR to add this

Top Results From Across the Web

Quick tour - Hugging Face

Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the prepare() method. This will make sure ...

How to shuffle an iterable dataset - PyTorch Forums

Hi, I am using the IterableDataset class in order to avoid loading the whole data to memory. However, I cannot shuffle the dataset...

Source code for pytorch_lightning.trainer.trainer

Args: accelerator: Supports passing different accelerator types ("cpu", "gpu", ... will make trainer.tune() run a learning rate finder, trying to optimize ...

ray.train.torch.train_loop_utils — Ray 3.0.0.dev0

return get_accelerator(_TorchAccelerator).get_device() ... DataLoader: """Prepares DataLoader for distributed execution. This allows you to use the same ...

NeMo Models - NVIDIA Documentation Center

super().__init__(cfg=cfg, trainer=trainer) # instantiate a BERT based encoder ... loop with the data from the training dataloader passed in as `batch`.