Dev Observability
Product
Pricing
Docs
Resources
Blog
Company
Debug Wordle

question-mark

Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

torch.sum bug

See original GitHub issue

❓ Questions and Help

There is a bug in my processing of adding a loss to the mask-rcnn first, there is a issue in the processing of turn int to float, sometime 1.0 will turn to some float value like 0.2345, which less than one. it is wired. so i print the value of variable to analysis, and the problem always in the processing of turning cpu.int to cuda.float. so I turn cpu.int to cpu.float, then to cuda.float. it works. but some time the stile have some problem:

gt sum is tensor([1.3014e+05, 9.0000e+00, 6.0100e+02, 3.2500e+02], device='cuda:0')
sg is tensor([6.3204e-35, 5.1167e-35, 7.5241e-35, 6.3204e-35], device='cuda:0')

secend, another prolem is like this:

2018-11-05 10-49-06

when i print the value of seg4loss, the sum_gt will be affected:

2018-11-05 10-49-58

the torch sum will be the last seg4loss i print, i check it three times, each with seg4loss[0], seg4loss[1],seg4loss[2], the sum_gt will change, which is the same with the print value.

when i delete the print code:

2018-11-05 10-50-22

it works well, i don’t know why:

2018-11-05 10-51-28

third: the softmax output always nan, and i use the torch.clamp to avoid it. But i don’t think that is a good idea, i will check the code again to find out the source of problem. by the way have anyone encountered this situation?

Issue Analytics

State:
Created 5 years ago
Comments:15 (7 by maintainers)

Top GitHub Comments

1reaction

akshaygadipatilcommented, Nov 23, 2018

I found the problem. It was an issue with my data format. Now it runs without the loss becoming ‘nan’.

@fmassa, thanks very much for your constant help!

0reactions

fmassacommented, Nov 22, 2018

Can you try running it on a standard COCO dataset for training to see if you also have nan? If no, then it’s probably an issue with your data.

Read more comments on GitHub >

Top Results From Across the Web

torch.sum(x, dim=()) inconsistent with documentation ... - GitHub

Bug According to the documentation of torch.sum dim (int or tuple of python:ints) – the dimension or dimensions to reduce.

The sum() function is inaccurate when the CPU evaluates the ...

The error of summation of float32 type data is normal, and the precision is close compared with other computed data. However, the calculation ......

In pytorch, why does sum(tensor) return a fault result instead ...

When I'm debugging , I meet a strange problem. >>> cos >>> tensor([ 0.3869, 0.2857, 0.4931, 0.5086, 0.6757, 0.6417, 0.3773, 0.4084, 0.2496, ...

How to measure the mean squared error(squared L2 norm) in ...

To compute the mean squared error in PyTorch, we apply the MSELoss() function provided by the torch.nn module. It creates a criterion that ......

torch.Tensor.cumsum(dim) bug in DLM tutorial?

Then outside the plate, we do a simple cumulative sum down the rows to mimic a trend component for each of the 6...

Top Related Medium Post

No results found

Top Related StackOverflow Question

No results found

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Top Related Reddit Thread

No results found

Top Related Hackernoon Post

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Top Related Hashnode Post

No results found

Add Multilabel Classification Support

RuntimeError: unable to open shared memory object </torch_29919_1396182366> in read-write mode