Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug][Test] `test_module_sign` occasionally fails

See original GitHub issue

🐛 Bug

tests/compute/test_transform.py::test_module_sign[g0] occasionally fails in Torch CPU/ Torch GPU/Windows CPU unit tests. For example:

on CPU: log file from PR-3885
on GPU: log file from PR-3713
on Windows CPU: log file from PR-4140

cc @mufeili

Issue Analytics

State:
Created a year ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

mufeilicommented, Jun 29, 2022

The error occurs again in http://dgl-jenkins-eksvpc-2136217999.us-west-2.elb.amazonaws.com/blue/organizations/jenkins/dgl/detail/PR-4183/1/pipeline/572. @mufeili

From the error message, it does seem to be a precision issue. Perhaps making the threshold larger in torch.allcose will address the issue.

0reactions

yaox12commented, Jun 29, 2022

~~I think this could be caused by matmul precision in PyTorch. https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices.~~

Starting in PyTorch 1.7, there is a new flag called allow_tf32. This flag defaults to True in PyTorch 1.7 to PyTorch 1.11, and False in PyTorch 1.12 and later.

~~We probably should set torch.backends.cuda.matmul.allow_tf32 = False for this test. cc @nv-dlasalle~~

But this can’t answer why it fails in CPU test.