question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decoder randomly outputs NaN tensor.

See original GitHub issue

Hi,

I just noticed misbehavior of decoder, seems to output NaN tensor randomly.

  • Problem AutoregressiveWrapper.generate randomly outputs NaN tensor and fails with RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

  • How to Reproduce the bug Set decoder to cuda device dec = PerformerLM(**dec_kwargs).to('cuda:1'), and repeat decoding inside the AutoregressiveWrapper:

    • performer_pytorch/autoregressive_wrapper.py(63)
    ...
    for _ in range(seq_len):
        x = out[:, -self.max_seq_len:]
        input_mask = input_mask[:, -self.max_seq_len:]
        logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :] <<-- HERE
    
    • output
    (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
    tensor([[-0.6147,  0.4647,  0.8009,  ..., -0.3772, -0.5126, -0.3495]],
           device='cuda:1')
    (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
    tensor([[-0.6792,  0.3940,  0.6685,  ..., -0.5081, -0.4801, -0.2691]],
           device='cuda:1')
    (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
    tensor([[-0.6146,  0.4647,  0.8011,  ..., -0.3772, -0.5128, -0.3496]],
           device='cuda:1')
    (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
    tensor([[-0.0530, -0.0343,  0.0998,  ...,  0.6310, -0.1682, -0.7353]],
           device='cuda:1')
    (Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
    tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:1') <<-- It randomly outputs NaN tensor.
    

Any ideas why this happens?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
y-rokutancommented, Jan 16, 2021

I’ve tested some possible reasons why this error happen, and suspect CUDA driver problem because of the following results:

  1. Assigning single GPU to the model, code runs without error.
  2. Assigning two GPUs w/ NVLink bridge, it throws CUDA Runtime Error: illegal memory access.
  3. Assigning two GPUs w/o NVLink bridge, it randomly outputs NaN tensor. (2. and 3. running the same code.)

I’m going to try other GPUs (perhaps GCP V100s) to check if this issue comes from CUDA/GPU or not.

0reactions
y-rokutancommented, Jan 23, 2021

For readers encountered similar issues: I found exporting CUDA_LAUNCH_BLOCKING=1 when using multiple GPUs (or nvlink). This probably comes from synchronizing GPUs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dirichlet distribution sometimes outputs tensor full of NaNs
Hey guys, Found an issue with Dirichlet distribution. Sometimes, for no apparent reason, it outputs a tensor full of NaNs.
Read more >
Common causes of nans during training of neural networks
Looking at the runtime log you probably won't notice anything unusual: loss is decreasing gradually, and all of a sudden a nan appears....
Read more >
Nan in encoder output - PyTorch Forums
For training, my encoder takes in a random subset of input training pairs (total pairs = 40 for each function) and produces a...
Read more >
Successive prediction (loop) in keras model generate NaN ...
TF 1.0: python -c "import tensorflow as tf; print(tf. ... i: int = 0) -> Tensor: """ :decode: Decode the output of Yolo3...
Read more >
Tensorflow model giving nan values for loss when running ...
Hello! I have an image segmentation model that runs fine when gives good output when I train with 1,500 to 3,500 images.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found