Notes on shuffling, sharding, and batchsize
See original GitHub issue(I’m writing this down here to have a written trace, but I’m looking forward to discuss this with you all in our upcoming meetings 😃 )
I spent some time porting the torchvision training recipes to use datapipes, and I noticed that the model I trained on ImageNet with DPs was much less accurate than the one with regular datasets. After a lot of digging I came to the following conclusion:
- the datapipe must be shuffled before it is sharded
- the DataLoader does not behave in the same way with a datapipe and with a regular indexable dataset, in particular when it comes to size of the last batches in an epoch. This has a dramatic effect on accuracy (probably because of batch-norm).
Details below. Note: for sharding, I used this custom torchvision sharder which takes DDP and dataloader workers into account, + the TakerIterDataPipe below it.
Shuffle before shard
First, some quick results (training a resnext50_32x4d for 5 epochs with 8 GPUs and 12 workers per GPU): Shuffle before shard: Acc@1 = 47% – this is on par with the regular indexable dataset version (phew!!) Shuffle after shard: Acc@1 = 2%
One way to explain this is that if we shuffle after we shard, then only sub-parts of the dataset get shuffled. Namely, each of the 8 * 12 = 96 dataloader workers receive ~1/96th of the dataset, and each of these parts get shuffled. But that means that the shuffling is far from uniform and for datasets in which the layout is all_samples_from_class1, all_samples_from_class2, ... all_samples_from_classN
, it’s possible that some class i is never in the same batch as class j.
So it looks like we need to shuffle before we shard. Now, if we shuffle before sharding, we still need to make sure that all of the 96 workers shuffle the dataset with the same RNG. Otherwise we risk sampling a given sample in more than one worker, or not at all. For that to happen, one can set a random seed in worker_init_fn
, but that causes a second problem: the random transformations of each worker will also be the same, and this will lead to slightly less accurate results; on top of that, all epochs will start with the same seed, so the shuffling is the same across all epochs. I do not know how to solve this problem yet.
Note that TF shuffles the dataset before storing it. We might do something similar, but that would still not solve the issue for custom users datasets.
Size of the batches at the end of an epoch
Some quick results (same experiment as above):
with drop_last=True: Acc@1 = 47% with drop_last=False: Acc@1 = 11%
Near the end of the epoch, the dataloader with DP will produce a lot of batches with size 1 if drop_last is False. See the last batches of an epoch on indices from [0, len(imagenet))
with a requested batch size of 32: https://pastebin.com/wjS7YC90. In contrast, this does not happen when using an indexable dataset: https://pastebin.com/Rje0U8Dx.
I’m not too sure of why this has such a dramatic impact, but it’s possible that this has to do with batch-norm, as @fmassa pointed out offline. Using drop_last
will make sure that the 1-sized batches are eliminated, producing a much better accuracy.
I guess the conclusion here is that it’s worth unifying the behaviour of the DataLoader both DPs and regular indexable datasets regarding the batch size, because with indexable datasets and drop_last=False we still get ~47% acc.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:28 (22 by maintainers)
Yeah. It will be one TODO in the proposal. I will make the RNG attached to shuffler can be seeded by
torch
@ejguan and I just had a chat offline where we discussed some of the points above. Here’s a summary of our discussion thus far. The points below are either re-hashing, or updating / correcting the ones above. @ejguan please feel free to edit / correct if this isn’t accurate. And thanks again for your time and all your work on this!
DataLoader
regarding the RNG of the workers: each worker will have its own RNG, different from the other workers. This allows (among other things) that each worker has different random transforms.torch.manual_seed()
will allow reproducible results for all RNGs that comes from torch. But the first point above still applies: while the RNG will be reproducible, workers will still get a different RNG.random
builtin module. Do you think it would make sense to allowtorch.manual_seed()
to also control that RNG? From a user perspective, I feel like it should be expected thattorch.manual_seed()
affects all random components that come from pytorch, including the Shuffler. In addition, the currentRandomSampler
andDistributedSampler(shuffle=True)
are all controllable throughtorch.manual_seed
. Perhaps we could still keep the implementation based on Python’s builtin, but yet allowtorch.manual_seed()
to control it as well?