Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

test_nms_cuda is flaky

See original GitHub issue

🐛 Bug

https://app.circleci.com/pipelines/github/pytorch/vision/2097/workflows/661fd235-202a-4c88-be4d-f8af378c195f/jobs/110511

================================== FAILURES ===================================
___________________________ NMSTester.test_nms_cuda ___________________________

self = <test_ops.NMSTester testMethod=test_nms_cuda>

    @unittest.skipIf(not torch.cuda.is_available(), "CUDA unavailable")
    def test_nms_cuda(self):
        err_msg = 'NMS incompatible between CPU and CUDA for IoU={}'
    
        for iou in [0.2, 0.5, 0.8]:
            boxes, scores = self._create_tensors_with_iou(1000, iou)
            r_cpu = ops.nms(boxes, scores, iou)
            r_cuda = ops.nms(boxes.cuda(), scores.cuda(), iou)
    
>           self.assertTrue(torch.allclose(r_cpu, r_cuda.cpu()), err_msg.format(iou))
E           RuntimeError: The size of tensor a (461) must match the size of tensor b (460) at non-singleton dimension 0

test\test_ops.py:403: RuntimeError

Issue Analytics

State:
Created 3 years ago
Comments:13 (13 by maintainers)

Top GitHub Comments

1reaction

hartbcommented, Apr 6, 2020

I looked at this some more…

If devIoU() is changed as discussed in https://github.com/pytorch/vision/pull/2044 (use a division while calculating IoU, similar to CPU calculation) then it’s possible compare the overlap values calculated by CPU vs CUDA.

With that change, overlap values calculated by CPU vs CUDA usually agree exactly, but they can differ from one another by up to 4 ULP (very rarely). Across 1000 seeds:

ULP diff   # instances
0          828
1           16
2          115
3           40
4            1

Even when the calculated overlaps differ, the testcase will still pass unless the overlap values straddle the threshold (i.e. one overlap greater than the threshold, the other not).

If NVCC’s fused multiply add optimization is disabled (e.g. by adding --fmad=false to NVCC_FLAGS) then the overlaps calculated by CPU and CUDA agree exactly, and the testcase does not fail (even if PR 2044’s change to _create_tensors_with_iou() is reverted).

Disabling fmad without changing devIoU() improves things (reduces failure rate to about 25%), but does not prevent the problem entirely.

NVCC enables fmad by default. It presumably benefits performance.

Disabling NVCC’s precise division optimization (--prec-div), did not affect the results at all.

To summarize:

small benefit (~12%) just from updating devIoU() to use similar division
large benefit (~75%) just from disabling NVCC fmad optimization
complete benefit from doing both
disabling fmad may affect performance

So maybe…

update devIoU() to use similar division as CPU
leave fmad to default to enabled
leave PR 2044’s changes to _create_tensors_with_iou() to force generated data a bit away from the threshold

If that sounds OK, I can put up a PR for the devIoU() change.

0reactions

fmassacommented, Apr 7, 2020

Thanks @hartb ! This has been fixed now