torch.sum bug
See original GitHub issue❓ Questions and Help
There is a bug in my processing of adding a loss to the mask-rcnn first, there is a issue in the processing of turn int to float, sometime 1.0 will turn to some float value like 0.2345, which less than one. it is wired. so i print the value of variable to analysis, and the problem always in the processing of turning cpu.int to cuda.float. so I turn cpu.int to cpu.float, then to cuda.float. it works. but some time the stile have some problem:
gt sum is tensor([1.3014e+05, 9.0000e+00, 6.0100e+02, 3.2500e+02], device='cuda:0')
sg is tensor([6.3204e-35, 5.1167e-35, 7.5241e-35, 6.3204e-35], device='cuda:0')
secend, another prolem is like this:
when i print the value of seg4loss, the sum_gt will be affected:
the torch sum will be the last seg4loss i print, i check it three times, each with seg4loss[0], seg4loss[1],seg4loss[2], the sum_gt will change, which is the same with the print value.
when i delete the print code:
it works well, i don’t know why:
third: the softmax output always nan, and i use the torch.clamp to avoid it. But i don’t think that is a good idea, i will check the code again to find out the source of problem. by the way have anyone encountered this situation?
Issue Analytics
- State:
- Created 5 years ago
- Comments:15 (7 by maintainers)
Top GitHub Comments
I found the problem. It was an issue with my data format. Now it runs without the loss becoming ‘nan’.
@fmassa, thanks very much for your constant help!
Can you try running it on a standard COCO dataset for training to see if you also have nan? If no, then it’s probably an issue with your data.