Evaluation code of references is slightly off
See original GitHub issueThere is a subtle known bug in the evaluation code of the classification references (and other references as well, but not all):
It deserves some attention, because it’s easy to miss and yet can impact our reported results, and those of research papers.
As the comment above describes, when computing the accuracy of the model on a validation set in a distributed setting, some images will be counted more than once if len(dataset)
isn’t divisible by batch_size * world_size
[^bignote].
On top of that, since the test_sampler
uses shuffle=True
by default, the duplicated images aren’t even the same across executions, which means that evaluating the same model on the same dataset can lead to different results every time.
Should we try to fix this, or should we just leave it and wait for the new lightning recipes to handle it? And as a follow-up question, is there a builtin way in lightning to mitigate this at all? (I’m not familiar with lightning, so this one may not make sense.)
[^bignote]: For example if we have 10 images and 2 workers with a batch_size of 3, we will have something like:
```
worker1: img1, img2, img3
worker2: img4, img5, img6
worker1: img7, img8, img9
worker2: img10, **img1, img2**
^^^^^^^^^
"padding": duplicated images which will affect the validation accuracy
```
cc @datumbox
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (16 by maintainers)
Top GitHub Comments
OK, things are a bit clearer to me now.
As @fmassa’s suggested (thanks!) the variance might come from the non-deterministic algorithms that are in use. I set the following:
and now I’m getting consistent results across batch-sizes. I also patched the code to figure out how many samples are effectively processed. Here are the results for different number of processes, all of them using the default batch-size = 32:
So:
len(dataset) % (batch_size * world_size) == 0
is wrong. Sometimes, the DataLoader can reduce the batch size of the last batches so that exactlylen(dataset)
samples are processed. Settingworld_size == 1
as @fmassa suggested above should indeed always process exactlylen(dataset)
samples, no matter the batch size.world_sizes * batch_size
values, but it’s not as high as what the previous analysis in https://github.com/pytorch/vision/issues/4559#issuecomment-939974184 would suggest. That being said, for other dataset sizes the number of duplicated samples may be higher. I might be wrong but I think that in the worst case, we can have at leastworld_size - 1
duplicated samples. For small dataset sizes this might impact the result quite a bit, but this doesn’t matter too much for our datasets.Thanks both for your input!!
Considering most of the variance is captured by disabling stochastic algorithms as above I would suggest to just set these flags to True if
test_only
is True, and to keep https://github.com/pytorch/vision/pull/4600 in the back of our mind for the next version of the references / recipes.I think we could also raise a warning if not exactly
len(dataset)
samples have been processed, to let the user know that the results might be slightly biased. This would require a small patch like this:Doh… Yes I meant
False
. I’ll edit in place to avoid confusion of future readers. :p