Could there be a bug in mixed precision?
See original GitHub issueWhen I use torch 1.6.0 & accelerate 0.3.0 and set mixed precision as yes
in accelerate config
, nothing happens (still full precision training). If I set in the code Accelerator(fp16=True)
then amp is triggered, but the loss becomes inf right away.
But if I use the pytorch way (i.e. autocast in the code myself), the training is normal and amp is enabled.
So I wonder if there is a possible bug in accelerate.
My enviroment is single 2080 Ti, local machine. The code with this problem is here.
Issue Analytics
- State:
- Created 2 years ago
- Comments:24 (9 by maintainers)
Top Results From Across the Web
Train With Mixed Precision - NVIDIA Documentation Center
This technique is called mixed-precision training since it uses both single- and half-precision representations.
Read more >Bug: Switching the spatial reference of a low-precision feature ...
Technical Article Details : Bug: Switching the spatial reference of a low-precision feature dataset may result in a mixed-precision feature dataset.
Read more >Mixed precision policy API - Keras
A dtype policy for a Keras layer. A dtype policy determines a layer's computation and variable dtypes. Each layer has a policy. Policies...
Read more >Mixed precision training with tf.keras on Weights & Biases
There are some configurations needed, however, in order to activate mixed precision training. We will see them in a later section.
Read more >Training with mixed precision: loss is NaN despite finite output ...
Good to know, thanks! But, if softmax was already using float32, then why did manually casting it to float32 (and then back to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was able to investigate this more and I think I found the problem. The PR above should fix the issue, would you mind giving it a try?
Thanks for the analysis and the example you provided. I’ll try to dig more into the differences tomorrow.