question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

❓ Questions and Help

There is a bug in my processing of adding a loss to the mask-rcnn first, there is a issue in the processing of turn int to float, sometime 1.0 will turn to some float value like 0.2345, which less than one. it is wired. so i print the value of variable to analysis, and the problem always in the processing of turning cpu.int to cuda.float. so I turn cpu.int to cpu.float, then to cuda.float. it works. but some time the stile have some problem:

gt sum is tensor([1.3014e+05, 9.0000e+00, 6.0100e+02, 3.2500e+02], device='cuda:0')
sg is tensor([6.3204e-35, 5.1167e-35, 7.5241e-35, 6.3204e-35], device='cuda:0')

secend, another prolem is like this:

2018-11-05 10-49-06

when i print the value of seg4loss, the sum_gt will be affected:

2018-11-05 10-49-58

the torch sum will be the last seg4loss i print, i check it three times, each with seg4loss[0], seg4loss[1],seg4loss[2], the sum_gt will change, which is the same with the print value.

when i delete the print code:

2018-11-05 10-50-22

it works well, i don’t know why:

2018-11-05 10-51-28

third: the softmax output always nan, and i use the torch.clamp to avoid it. But i don’t think that is a good idea, i will check the code again to find out the source of problem. by the way have anyone encountered this situation?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:15 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
akshaygadipatilcommented, Nov 23, 2018

I found the problem. It was an issue with my data format. Now it runs without the loss becoming ‘nan’.

@fmassa, thanks very much for your constant help!

0reactions
fmassacommented, Nov 22, 2018

Can you try running it on a standard COCO dataset for training to see if you also have nan? If no, then it’s probably an issue with your data.

Read more comments on GitHub >

github_iconTop Results From Across the Web

torch.sum(x, dim=()) inconsistent with documentation ... - GitHub
Bug According to the documentation of torch.sum dim (int or tuple of python:ints) – the dimension or dimensions to reduce.
Read more >
The sum() function is inaccurate when the CPU evaluates the ...
The error of summation of float32 type data is normal, and the precision is close compared with other computed data. However, the calculation ......
Read more >
In pytorch, why does sum(tensor) return a fault result instead ...
When I'm debugging , I meet a strange problem. >>> cos >>> tensor([ 0.3869, 0.2857, 0.4931, 0.5086, 0.6757, 0.6417, 0.3773, 0.4084, 0.2496, ...
Read more >
How to measure the mean squared error(squared L2 norm) in ...
To compute the mean squared error in PyTorch, we apply the MSELoss() function provided by the torch.nn module. It creates a criterion that ......
Read more >
torch.Tensor.cumsum(dim) bug in DLM tutorial?
Then outside the plate, we do a simple cumulative sum down the rows to mimic a trend component for each of the 6...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found