question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

test_nms_cuda is flaky

See original GitHub issue

🐛 Bug

https://app.circleci.com/pipelines/github/pytorch/vision/2097/workflows/661fd235-202a-4c88-be4d-f8af378c195f/jobs/110511

================================== FAILURES ===================================
___________________________ NMSTester.test_nms_cuda ___________________________

self = <test_ops.NMSTester testMethod=test_nms_cuda>

    @unittest.skipIf(not torch.cuda.is_available(), "CUDA unavailable")
    def test_nms_cuda(self):
        err_msg = 'NMS incompatible between CPU and CUDA for IoU={}'
    
        for iou in [0.2, 0.5, 0.8]:
            boxes, scores = self._create_tensors_with_iou(1000, iou)
            r_cpu = ops.nms(boxes, scores, iou)
            r_cuda = ops.nms(boxes.cuda(), scores.cuda(), iou)
    
>           self.assertTrue(torch.allclose(r_cpu, r_cuda.cpu()), err_msg.format(iou))
E           RuntimeError: The size of tensor a (461) must match the size of tensor b (460) at non-singleton dimension 0

test\test_ops.py:403: RuntimeError

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

1reaction
hartbcommented, Apr 6, 2020

I looked at this some more…

If devIoU() is changed as discussed in https://github.com/pytorch/vision/pull/2044 (use a division while calculating IoU, similar to CPU calculation) then it’s possible compare the overlap values calculated by CPU vs CUDA.

With that change, overlap values calculated by CPU vs CUDA usually agree exactly, but they can differ from one another by up to 4 ULP (very rarely). Across 1000 seeds:

ULP diff   # instances
0          828
1           16
2          115
3           40
4            1

Even when the calculated overlaps differ, the testcase will still pass unless the overlap values straddle the threshold (i.e. one overlap greater than the threshold, the other not).

If NVCC’s fused multiply add optimization is disabled (e.g. by adding --fmad=false to NVCC_FLAGS) then the overlaps calculated by CPU and CUDA agree exactly, and the testcase does not fail (even if PR 2044’s change to _create_tensors_with_iou() is reverted).

Disabling fmad without changing devIoU() improves things (reduces failure rate to about 25%), but does not prevent the problem entirely.

NVCC enables fmad by default. It presumably benefits performance.

Disabling NVCC’s precise division optimization (--prec-div), did not affect the results at all.

To summarize:

  • small benefit (~12%) just from updating devIoU() to use similar division
  • large benefit (~75%) just from disabling NVCC fmad optimization
  • complete benefit from doing both
  • disabling fmad may affect performance

So maybe…

  • update devIoU() to use similar division as CPU
  • leave fmad to default to enabled
  • leave PR 2044’s changes to _create_tensors_with_iou() to force generated data a bit away from the threshold

If that sounds OK, I can put up a PR for the devIoU() change.

0reactions
fmassacommented, Apr 7, 2020

Thanks @hartb ! This has been fixed now

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is a flaky test? Definition from WhatIs.com. - TechTarget
A flaky test is an analysis of web application code that fails to produce the same result each time the same analysis is...
Read more >
How to Fix Flaky Tests - Semaphore CI
Flaky tests hinder development, slow down progress, hide design problems, and can cost a lot of money in the long run.
Read more >
What are Flaky Tests? | TeamCity CI/CD Guide - JetBrains
Flaky tests are tests that return new results, despite there being no changes to code. Find out why flaky tests matter and how...
Read more >
Test Flakiness - Methods for identifying and dealing with flaky ...
A flaky test is a test that both passes and fails periodically without any code changes. Flaky tests are definitely annoying but they...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found