Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Notes on shuffling, sharding, and batchsize

See original GitHub issue

(I’m writing this down here to have a written trace, but I’m looking forward to discuss this with you all in our upcoming meetings 😃 )

I spent some time porting the torchvision training recipes to use datapipes, and I noticed that the model I trained on ImageNet with DPs was much less accurate than the one with regular datasets. After a lot of digging I came to the following conclusion:

the datapipe must be shuffled before it is sharded
the DataLoader does not behave in the same way with a datapipe and with a regular indexable dataset, in particular when it comes to size of the last batches in an epoch. This has a dramatic effect on accuracy (probably because of batch-norm).

Details below. Note: for sharding, I used this custom torchvision sharder which takes DDP and dataloader workers into account, + the TakerIterDataPipe below it.

Shuffle before shard

First, some quick results (training a resnext50_32x4d for 5 epochs with 8 GPUs and 12 workers per GPU): Shuffle before shard: Acc@1 = 47% – this is on par with the regular indexable dataset version (phew!!) Shuffle after shard: Acc@1 = 2%

One way to explain this is that if we shuffle after we shard, then only sub-parts of the dataset get shuffled. Namely, each of the 8 * 12 = 96 dataloader workers receive ~1/96th of the dataset, and each of these parts get shuffled. But that means that the shuffling is far from uniform and for datasets in which the layout is all_samples_from_class1, all_samples_from_class2, ... all_samples_from_classN, it’s possible that some class i is never in the same batch as class j.

So it looks like we need to shuffle before we shard. Now, if we shuffle before sharding, we still need to make sure that all of the 96 workers shuffle the dataset with the same RNG. Otherwise we risk sampling a given sample in more than one worker, or not at all. For that to happen, one can set a random seed in worker_init_fn, but that causes a second problem: the random transformations of each worker will also be the same, and this will lead to slightly less accurate results; on top of that, all epochs will start with the same seed, so the shuffling is the same across all epochs. I do not know how to solve this problem yet.

Note that TF shuffles the dataset before storing it. We might do something similar, but that would still not solve the issue for custom users datasets.

Size of the batches at the end of an epoch

Some quick results (same experiment as above):

with drop_last=True: Acc@1 = 47% with drop_last=False: Acc@1 = 11%

Near the end of the epoch, the dataloader with DP will produce a lot of batches with size 1 if drop_last is False. See the last batches of an epoch on indices from [0, len(imagenet)) with a requested batch size of 32: https://pastebin.com/wjS7YC90. In contrast, this does not happen when using an indexable dataset: https://pastebin.com/Rje0U8Dx.

I’m not too sure of why this has such a dramatic impact, but it’s possible that this has to do with batch-norm, as @fmassa pointed out offline. Using drop_last will make sure that the 1-sized batches are eliminated, producing a much better accuracy.

I guess the conclusion here is that it’s worth unifying the behaviour of the DataLoader both DPs and regular indexable datasets regarding the batch size, because with indexable datasets and drop_last=False we still get ~47% acc.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:28 (22 by maintainers)

Top GitHub Comments

2reactions

ejguancommented, May 5, 2022

quick Q: currently, the Shuffler’s RNG comes from Python’s random builtin module. Do you think it would make sense to allow torch.manual_seed() to also control that RNG?

Yeah. It will be one TODO in the proposal. I will make the RNG attached to shuffler can be seeded by torch

1reaction

NicolasHugcommented, May 5, 2022

@ejguan and I just had a chat offline where we discussed some of the points above. Here’s a summary of our discussion thus far. The points below are either re-hashing, or updating / correcting the ones above. @ejguan please feel free to edit / correct if this isn’t accurate. And thanks again for your time and all your work on this!

For both datapipes and map-style datasets, DataLoaderV2 will provide a similar logic to what currently exists in DataLoader regarding the RNG of the workers: each worker will have its own RNG, different from the other workers. This allows (among other things) that each worker has different random transforms.
However, in order for shuffling to be done correctly by all workers on a datapipe, DataLoaderV2 will have a special handling for the shuffling RNG. This will be transparent and users won’t need to worry about the implementation details. Erjia will be sharing a proposal in the following week(s).
For reproducibility: calling torch.manual_seed() will allow reproducible results for all RNGs that comes from torch. But the first point above still applies: while the RNG will be reproducible, workers will still get a different RNG.
- @ejguan, quick Q: currently, the Shuffler’s RNG comes from Python’s random builtin module. Do you think it would make sense to allow torch.manual_seed() to also control that RNG? From a user perspective, I feel like it should be expected that torch.manual_seed() affects all random components that come from pytorch, including the Shuffler. In addition, the current RandomSampler and DistributedSampler(shuffle=True) are all controllable through torch.manual_seed. Perhaps we could still keep the implementation based on Python’s builtin, but yet allow torch.manual_seed() to control it as well?

Top Results From Across the Web

Question about the randomness when doing the sharding #2849

When using the fileReader with sharding, each gpu on each process only takes part of data. It makes sense only when the shuffling...

Sharding — NVIDIA DALI 1.20.0 documentation

Sharding allows DALI to partition the dataset into nonoverlapping pieces on which each DALI pipeline instance can work. This functionality addresses the ...

Why globally re-shuffle? Revisiting data shuffling in large ...

In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers.

A Guide to (Highly) Distributed DNN Training | by Chaim Rand

An alternative to sharding is to simply enter random shuffles of the full dataset to each of the workers. Note that the resulting...

Distributed Input | TensorFlow Core

The attempt to shard by FILE fails if a file-based dataset is not detected. tf.distribute will then fall back to sharding by DATA....