Decoder randomly outputs NaN tensor.
See original GitHub issueHi,
I just noticed misbehavior of decoder, seems to output NaN tensor randomly.
-
Problem AutoregressiveWrapper.generate randomly outputs NaN tensor and fails with
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
-
How to Reproduce the bug Set decoder to cuda device
dec = PerformerLM(**dec_kwargs).to('cuda:1')
, and repeat decoding inside theAutoregressiveWrapper
:- performer_pytorch/autoregressive_wrapper.py(63)
... for _ in range(seq_len): x = out[:, -self.max_seq_len:] input_mask = input_mask[:, -self.max_seq_len:] logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :] <<-- HERE
- output
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :] tensor([[-0.6147, 0.4647, 0.8009, ..., -0.3772, -0.5126, -0.3495]], device='cuda:1') (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :] tensor([[-0.6792, 0.3940, 0.6685, ..., -0.5081, -0.4801, -0.2691]], device='cuda:1') (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :] tensor([[-0.6146, 0.4647, 0.8011, ..., -0.3772, -0.5128, -0.3496]], device='cuda:1') (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :] tensor([[-0.0530, -0.0343, 0.0998, ..., 0.6310, -0.1682, -0.7353]], device='cuda:1') (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :] tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:1') <<-- It randomly outputs NaN tensor.
Any ideas why this happens?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (1 by maintainers)
Top Results From Across the Web
Dirichlet distribution sometimes outputs tensor full of NaNs
Hey guys, Found an issue with Dirichlet distribution. Sometimes, for no apparent reason, it outputs a tensor full of NaNs.
Read more >Common causes of nans during training of neural networks
Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears....
Read more >Nan in encoder output - PyTorch Forums
For training, my encoder takes in a random subset of input training pairs (total pairs = 40 for each function) and produces a...
Read more >Successive prediction (loop) in keras model generate NaN ...
TF 1.0: python -c "import tensorflow as tf; print(tf. ... i: int = 0) -> Tensor: """ :decode: Decode the output of Yolo3...
Read more >Tensorflow model giving nan values for loss when running ...
Hello! I have an image segmentation model that runs fine when gives good output when I train with 1,500 to 3,500 images.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I’ve tested some possible reasons why this error happen, and suspect CUDA driver problem because of the following results:
I’m going to try other GPUs (perhaps GCP V100s) to check if this issue comes from CUDA/GPU or not.
For readers encountered similar issues: I found exporting
CUDA_LAUNCH_BLOCKING=1
when using multiple GPUs (or nvlink). This probably comes from synchronizing GPUs.