[Bug][Test] `test_module_sign` occasionally fails
See original GitHub issueIssue Analytics
- State:
- Created a year ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
From the error message, it does seem to be a precision issue. Perhaps making the threshold larger in
torch.allcose
will address the issue.~~I think this could be caused by matmul precision in PyTorch. https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices.~~
We probably should settorch.backends.cuda.matmul.allow_tf32 = False
for this test. cc @nv-dlasalleBut this can’t answer why it fails in CPU test.