question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Evaluation code of references is slightly off

See original GitHub issue

There is a subtle known bug in the evaluation code of the classification references (and other references as well, but not all):

https://github.com/pytorch/vision/blob/261cbf7e939a253ea316b74bec8dbba58155ab4e/references/classification/train.py#L65-L66

It deserves some attention, because it’s easy to miss and yet can impact our reported results, and those of research papers.

As the comment above describes, when computing the accuracy of the model on a validation set in a distributed setting, some images will be counted more than once if len(dataset) isn’t divisible by batch_size * world_size [^bignote].

On top of that, since the test_sampler uses shuffle=True by default, the duplicated images aren’t even the same across executions, which means that evaluating the same model on the same dataset can lead to different results every time.

Should we try to fix this, or should we just leave it and wait for the new lightning recipes to handle it? And as a follow-up question, is there a builtin way in lightning to mitigate this at all? (I’m not familiar with lightning, so this one may not make sense.)

[^bignote]: For example if we have 10 images and 2 workers with a batch_size of 3, we will have something like:

```
worker1: img1, img2, img3
worker2: img4, img5, img6
worker1: img7, img8, img9
worker2: img10, **img1, img2** 
                  ^^^^^^^^^
 "padding": duplicated images which will affect the validation accuracy
 ```

cc @datumbox

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
NicolasHugcommented, Oct 12, 2021

OK, things are a bit clearer to me now.

As @fmassa’s suggested (thanks!) the variance might come from the non-deterministic algorithms that are in use. I set the following:

    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True

and now I’m getting consistent results across batch-sizes. I also patched the code to figure out how many samples are effectively processed. Here are the results for different number of processes, all of them using the default batch-size = 32:

Test:  Acc@1 69.762 Acc@5 89.076, 50000 samples processed -- world_size = 1 or 2 or 4 or 5 or 8
Test:  Acc@1 69.760 Acc@5 89.075, 50004 samples processed -- world_size = 6
Test:  Acc@1 69.761 Acc@5 89.074, 50001 samples processed -- world_size = 3 or 7

So:

  • my previous understanding that we really need to have len(dataset) % (batch_size * world_size) == 0 is wrong. Sometimes, the DataLoader can reduce the batch size of the last batches so that exactly len(dataset) samples are processed. Setting world_size == 1 as @fmassa suggested above should indeed always process exactly len(dataset) samples, no matter the batch size.
  • There’s still some visible variance in the result across world_sizes * batch_size values, but it’s not as high as what the previous analysis in https://github.com/pytorch/vision/issues/4559#issuecomment-939974184 would suggest. That being said, for other dataset sizes the number of duplicated samples may be higher. I might be wrong but I think that in the worst case, we can have at least world_size - 1 duplicated samples. For small dataset sizes this might impact the result quite a bit, but this doesn’t matter too much for our datasets.

Thanks both for your input!!

Considering most of the variance is captured by disabling stochastic algorithms as above I would suggest to just set these flags to True if test_only is True, and to keep https://github.com/pytorch/vision/pull/4600 in the back of our mind for the next version of the references / recipes.

I think we could also raise a warning if not exactly len(dataset) samples have been processed, to let the user know that the results might be slightly biased. This would require a small patch like this:

diff --git a/references/classification/train.py b/references/classification/train.py
index a71d337a..de520fb3 100644
--- a/references/classification/train.py
+++ b/references/classification/train.py
@@ -54,6 +54,13 @@ def evaluate(model, criterion, data_loader, device, print_freq=100, log_suffix="
     model.eval()
     metric_logger = utils.MetricLogger(delimiter="  ")
     header = f"Test: {log_suffix}"
+    def _reduce(val):
+        val = torch.tensor([val], dtype=torch.int, device="cuda")
+        torch.distributed.barrier()
+        torch.distributed.all_reduce(val)
+        return val.item()
+
+    n_samples = 0
     with torch.no_grad():
         for image, target in metric_logger.log_every(data_loader, print_freq, header):
             image = image.to(device, non_blocking=True)
@@ -68,7 +75,12 @@ def evaluate(model, criterion, data_loader, device, print_freq=100, log_suffix="
             metric_logger.update(loss=loss.item())
             metric_logger.meters["acc1"].update(acc1.item(), n=batch_size)
             metric_logger.meters["acc5"].update(acc5.item(), n=batch_size)
+            n_samples += batch_size
     # gather the stats from all processes
+
+    n_samples = _reduce(n_samples)
+    print(f"We processed {n_samples} in total")
1reaction
datumboxcommented, Oct 11, 2021

Did you mean shuffle=False?

Doh… Yes I meant False. I’ll edit in place to avoid confusion of future readers. :p

Read more comments on GitHub >

github_iconTop Results From Across the Web

Corpus Bleu evaluation when number of references ... - GitHub
Corpus Bleu evaluation when number of references and hypothesis are large and code runs out of memory #46.
Read more >
A Quick-Reference Card for Identifying Level-4 Visits - AAFP
According to CPT, 99214 is indicated for an “office or other outpatient visit for the evaluation and management of an established patient, which...
Read more >
Code and Guideline Changes | AMA
Referral without evaluation (by history, examination, or diagnostic Page 5 5 CPT is a registered trademark of the American Medical Association. ...
Read more >
Evaluation and Management Coding, E/M Codes - AAPC
Both the 1995 and 1997 E/M Documentation guidelines from CMS are still in use. Many third-party payers also apply these guidelines. This article...
Read more >
Evaluation and Management (E/M) Services in the Domiciliary ...
CPT code 99335 is used to reflect the Domiciliary or rest home visit for the evaluation and management of an established patient, which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found