Shuffling of IterableDataset disabled when dataloader is passed to accelerator.prepare()
See original GitHub issueHi, I have an issue with shuffling an IterableDataset
that is passed to a dataloader in accelerate
. I am trying to use ShufflerIterDataPipe
from torch.utils.data.datapipes.iter.combinatorics
.
As you can see below, it looks like the shuffling is disabled:
from accelerate import Accelerator
from datasets import load_dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.data.datapipes.iter.combinatorics import ShufflerIterDataPipe
data = load_dataset("lvwerra/codeparrot-clean-valid", streaming=True, split="train")
shuffled_data = ShufflerIterDataPipe(data, buffer_size=100)
dataloader = DataLoader(shuffled_data, batch_size=4, shuffle=True)
print(f"Before accelerate prepare: {dataloader.dataset._shuffle_enabled}")
accelerator = Accelerator()
dataloader = accelerator.prepare(dataloader)
print(f"After accelerate prepare: {dataloader.dataset._shuffle_enabled}")
Output:
Before accelerate prepare: True
After accelerate prepare: False
I think it’s because ShufflerIterDataPipe
returns a new Pytorch IterableDataset
that is also an IterDataPipe
. When such instance is passed to a Pytorch dataloader, for the shuffling to be enabled we must specify shuffle=True
in the dataloader, the distinction is done here. I don’t think this is handled during the dataloader preparation in accelerate
.
Python 3.9.12
torch=1.11.0
accelerate=0.8.0
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Quick tour - Hugging Face
Pass all objects relevant to training (optimizer, model, training dataloader, learning rate scheduler) to the prepare() method. This will make sure ...
Read more >How to shuffle an iterable dataset - PyTorch Forums
Hi, I am using the IterableDataset class in order to avoid loading the whole data to memory. However, I cannot shuffle the dataset...
Read more >Source code for pytorch_lightning.trainer.trainer
Args: accelerator: Supports passing different accelerator types ("cpu", "gpu", ... will make trainer.tune() run a learning rate finder, trying to optimize ...
Read more >ray.train.torch.train_loop_utils — Ray 3.0.0.dev0
return get_accelerator(_TorchAccelerator).get_device() ... DataLoader: """Prepares DataLoader for distributed execution. This allows you to use the same ...
Read more >NeMo Models - NVIDIA Documentation Center
super().__init__(cfg=cfg, trainer=trainer) # instantiate a BERT based encoder ... loop with the data from the training dataloader passed in as `batch`.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Happy to review!
It works with
torch.utils.data.graph_settings.apply_shuffle_settings(dataset, shuffle=shuffle)
I can make a PR to add this