Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation code of references is slightly off

See original GitHub issue

There is a subtle known bug in the evaluation code of the classification references (and other references as well, but not all):

https://github.com/pytorch/vision/blob/261cbf7e939a253ea316b74bec8dbba58155ab4e/references/classification/train.py#L65-L66

It deserves some attention, because it’s easy to miss and yet can impact our reported results, and those of research papers.

As the comment above describes, when computing the accuracy of the model on a validation set in a distributed setting, some images will be counted more than once if len(dataset) isn’t divisible by batch_size * world_size [^bignote].

On top of that, since the test_sampler uses shuffle=True by default, the duplicated images aren’t even the same across executions, which means that evaluating the same model on the same dataset can lead to different results every time.

Should we try to fix this, or should we just leave it and wait for the new lightning recipes to handle it? And as a follow-up question, is there a builtin way in lightning to mitigate this at all? (I’m not familiar with lightning, so this one may not make sense.)

[^bignote]: For example if we have 10 images and 2 workers with a batch_size of 3, we will have something like:

```
worker1: img1, img2, img3
worker2: img4, img5, img6
worker1: img7, img8, img9
worker2: img10, **img1, img2** 
                  ^^^^^^^^^
 "padding": duplicated images which will affect the validation accuracy
 ```

cc @datumbox

Issue Analytics

State:
Created 2 years ago
Comments:16 (16 by maintainers)

Top GitHub Comments

1reaction

NicolasHugcommented, Oct 12, 2021

OK, things are a bit clearer to me now.

As @fmassa’s suggested (thanks!) the variance might come from the non-deterministic algorithms that are in use. I set the following:

    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True

and now I’m getting consistent results across batch-sizes. I also patched the code to figure out how many samples are effectively processed. Here are the results for different number of processes, all of them using the default batch-size = 32:

Test:  Acc@1 69.762 Acc@5 89.076, 50000 samples processed -- world_size = 1 or 2 or 4 or 5 or 8
Test:  Acc@1 69.760 Acc@5 89.075, 50004 samples processed -- world_size = 6
Test:  Acc@1 69.761 Acc@5 89.074, 50001 samples processed -- world_size = 3 or 7

So:

my previous understanding that we really need to have len(dataset) % (batch_size * world_size) == 0 is wrong. Sometimes, the DataLoader can reduce the batch size of the last batches so that exactly len(dataset) samples are processed. Setting world_size == 1 as @fmassa suggested above should indeed always process exactly len(dataset) samples, no matter the batch size.
There’s still some visible variance in the result across world_sizes * batch_size values, but it’s not as high as what the previous analysis in https://github.com/pytorch/vision/issues/4559#issuecomment-939974184 would suggest. That being said, for other dataset sizes the number of duplicated samples may be higher. I might be wrong but I think that in the worst case, we can have at least world_size - 1 duplicated samples. For small dataset sizes this might impact the result quite a bit, but this doesn’t matter too much for our datasets.

Thanks both for your input!!

Considering most of the variance is captured by disabling stochastic algorithms as above I would suggest to just set these flags to True if test_only is True, and to keep https://github.com/pytorch/vision/pull/4600 in the back of our mind for the next version of the references / recipes.

I think we could also raise a warning if not exactly len(dataset) samples have been processed, to let the user know that the results might be slightly biased. This would require a small patch like this:

diff --git a/references/classification/train.py b/references/classification/train.py
index a71d337a..de520fb3 100644
--- a/references/classification/train.py
+++ b/references/classification/train.py
@@ -54,6 +54,13 @@ def evaluate(model, criterion, data_loader, device, print_freq=100, log_suffix="
     model.eval()
     metric_logger = utils.MetricLogger(delimiter="  ")
     header = f"Test: {log_suffix}"
+    def _reduce(val):
+        val = torch.tensor([val], dtype=torch.int, device="cuda")
+        torch.distributed.barrier()
+        torch.distributed.all_reduce(val)
+        return val.item()
+
+    n_samples = 0
     with torch.no_grad():
         for image, target in metric_logger.log_every(data_loader, print_freq, header):
             image = image.to(device, non_blocking=True)
@@ -68,7 +75,12 @@ def evaluate(model, criterion, data_loader, device, print_freq=100, log_suffix="
             metric_logger.update(loss=loss.item())
             metric_logger.meters["acc1"].update(acc1.item(), n=batch_size)
             metric_logger.meters["acc5"].update(acc5.item(), n=batch_size)
+            n_samples += batch_size
     # gather the stats from all processes
+
+    n_samples = _reduce(n_samples)
+    print(f"We processed {n_samples} in total")

1reaction

datumboxcommented, Oct 11, 2021