question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decoder AMP training loss blow up

See original GitHub issue

Just pin my training issue:

when set amp = True, the decoder’s loss(unet1 & unet2) blowed up to 1K+. [loss <0.01 when amp declined] Just curious, does the half-precision training mode harm the DDPM training?

image

as a ref: https://wandb.ai/hcaoaf/dalle2-decoder/runs/3sb0pumh?workspace=user-hcaoaf

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
CiaoHecommented, May 15, 2022

@lucidrains wow cool built-in grad-accum support. Now, everything works fine (max-batch-size:32, total-batch-size:265, amp:True) image

1reaction
CiaoHecommented, May 15, 2022

@CiaoHe how does it look without AMP? i also built in an unconditional feature for the decoder, so one can train it without cross attention conditioning (im guessing it is probably the attention blocks blowing up, so i added some additional normalization that should help)

here’s w/o AMP. image

@lucidrains let me try the current version

Read more comments on GitHub >

github_iconTop Results From Across the Web

What can be the cause of a sudden explosion in the loss when ...
The first 50k steps of the training the loss is quite stable and low, and suddenly it starts to exponentially explode. I wonder...
Read more >
loss explodes after few iterations · Issue #3868 - GitHub
I recheck csv files which created . I am training on only one class but gradient is exploding after few iterations exponentially please...
Read more >
Exploding loss in encoder/decoder model - PyTorch Forums
I'm trying to build a text to speech model in PyTorch using an encoder/decoder architecture on librispeech 100hr dataset.
Read more >
9 Tips For Training Lightning-Fast Neural Networks In Pytorch
The amp package will take care of most things for you. It'll even scale the loss if the gradients explode or go to...
Read more >
Exploding loss in encoder/decoder model PyTorch
You are calculating your loss only on the last mini batch, you should accumulate your loss and do backward on the accumulated loss...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found