Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decoder randomly outputs NaN tensor.

See original GitHub issue

Hi,

I just noticed misbehavior of decoder, seems to output NaN tensor randomly.

Problem AutoregressiveWrapper.generate randomly outputs NaN tensor and fails with RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

How to Reproduce the bug Set decoder to cuda device dec = PerformerLM(**dec_kwargs).to('cuda:1'), and repeat decoding inside the AutoregressiveWrapper:

performer_pytorch/autoregressive_wrapper.py(63)

...
for _ in range(seq_len):
    x = out[:, -self.max_seq_len:]
    input_mask = input_mask[:, -self.max_seq_len:]
    logits = self.net(x, mask=input_mask, **kwargs)[:, -1, :] <<-- HERE

output

(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.6147,  0.4647,  0.8009,  ..., -0.3772, -0.5126, -0.3495]],
       device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.6792,  0.3940,  0.6685,  ..., -0.5081, -0.4801, -0.2691]],
       device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.6146,  0.4647,  0.8011,  ..., -0.3772, -0.5128, -0.3496]],
       device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[-0.0530, -0.0343,  0.0998,  ...,  0.6310, -0.1682, -0.7353]],
       device='cuda:1')
(Pdb) self.net(x, mask=input_mask, **kwargs)[:, -1, :]
tensor([[nan, nan, nan,  ..., nan, nan, nan]], device='cuda:1') <<-- It randomly outputs NaN tensor.

Any ideas why this happens?

Issue Analytics

State:
Created 3 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

y-rokutancommented, Jan 16, 2021

I’ve tested some possible reasons why this error happen, and suspect CUDA driver problem because of the following results:

Assigning single GPU to the model, code runs without error.
Assigning two GPUs w/ NVLink bridge, it throws CUDA Runtime Error: illegal memory access.
Assigning two GPUs w/o NVLink bridge, it randomly outputs NaN tensor. (2. and 3. running the same code.)

I’m going to try other GPUs (perhaps GCP V100s) to check if this issue comes from CUDA/GPU or not.

0reactions

y-rokutancommented, Jan 23, 2021

For readers encountered similar issues: I found exporting CUDA_LAUNCH_BLOCKING=1 when using multiple GPUs (or nvlink). This probably comes from synchronizing GPUs.