[Dreambooth Example] Attempting to unscale FP16 gradients.
See original GitHub issueDescribe the bug
I had the training script working fine but then I updated diffusers to 0.7.2 and now I get the following error:
Traceback (most recent call last):
File "/tmp/pycharm_project_990/train_dreambooth.py", line 938, in <module>
main(args)
File "/tmp/pycharm_project_990/train_dreambooth.py", line 876, in main
optimizer.step()
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/accelerate/optimizer.py", line 134, in step
self.scaler.step(self.optimizer, closure)
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
self.unscale_(optimizer)
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0%| | 0/800 [00:18<?, ?it/s]
Any ideas, or do I need to downgrade?
Reproduction
No response
Logs
No response
System Info
diffusers 0.7.2 python 3.7.12 accelerate 0.14.0
Issue Analytics
- State:
- Created 10 months ago
- Comments:26 (11 by maintainers)
Top Results From Across the Web
[0.4.1] ValueError: Attempting to unscale FP16 gradients. #834
For example, what kind of optimizer would be used with FP16 grads? How is the optimizer state being handled? Is there a reason...
Read more >ValueError : Attemting to unscale fp16 Gradients
Hello all, I am trying to train an LSTM in the half-precision setting. The LSTM takes an encoded input from a pre-trained autoencoder(Not ......
Read more >Automatic Mixed Precision Using PyTorch - Paperspace Blog
Data from the FP16 pipeline is processed using Tensor Cores to conduct GEMMs ... You may unscale the gradients of other parameters that...
Read more >Train With Mixed Precision - NVIDIA Documentation Center
Porting the model to use the FP16 data type where appropriate. Adding loss scaling to preserve small gradient values. The ability to train...
Read more >Help & Questions Megathread! : r/StableDiffusion - Reddit
problem when i click train (dream booth) (automatic 1111). "Returning result: Training finished. Total lifetime steps: 0" is what i get 3 mins ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the detailed issue, taking a look now.
Hi all, sorry for the radio silence… some time sensitive matters snuck up on me. I hope one of the other contributors to this issue can confirm the fix, otherwise I hope to have a chance to try this out on Sunday and promise to report back after.
Thank you both @patil-suraj and @patrickvonplaten for your amazing and quick work here! (And patil-suraj, thanks, I indeed got dreambooth working with fp32 too, it kind of fixed itself but I think I had been loading one of the components with an incompatible model).
🙏