question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Dreambooth Example] Attempting to unscale FP16 gradients.

See original GitHub issue

Describe the bug

I had the training script working fine but then I updated diffusers to 0.7.2 and now I get the following error:

Traceback (most recent call last):
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 938, in <module>
    main(args)
  File "/tmp/pycharm_project_990/train_dreambooth.py", line 876, in main
    optimizer.step()
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 337, in step
    self.unscale_(optimizer)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 282, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(optimizer, inv_scale, found_inf, False)
  File "/opt/conda/envs/dreambooth/lib/python3.7/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0%|          | 0/800 [00:18<?, ?it/s]

Any ideas, or do I need to downgrade?

Reproduction

No response

Logs

No response

System Info

diffusers 0.7.2 python 3.7.12 accelerate 0.14.0

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:26 (11 by maintainers)

github_iconTop GitHub Comments

2reactions
patil-surajcommented, Nov 21, 2022

Thanks for the detailed issue, taking a look now.

1reaction
gadicccommented, Dec 1, 2022

Hi all, sorry for the radio silence… some time sensitive matters snuck up on me. I hope one of the other contributors to this issue can confirm the fix, otherwise I hope to have a chance to try this out on Sunday and promise to report back after.

Thank you both @patil-suraj and @patrickvonplaten for your amazing and quick work here! (And patil-suraj, thanks, I indeed got dreambooth working with fp32 too, it kind of fixed itself but I think I had been loading one of the components with an incompatible model).

🙏

Read more comments on GitHub >

github_iconTop Results From Across the Web

[0.4.1] ValueError: Attempting to unscale FP16 gradients. #834
For example, what kind of optimizer would be used with FP16 grads? How is the optimizer state being handled? Is there a reason...
Read more >
ValueError : Attemting to unscale fp16 Gradients
Hello all, I am trying to train an LSTM in the half-precision setting. The LSTM takes an encoded input from a pre-trained autoencoder(Not ......
Read more >
Automatic Mixed Precision Using PyTorch - Paperspace Blog
Data from the FP16 pipeline is processed using Tensor Cores to conduct GEMMs ... You may unscale the gradients of other parameters that...
Read more >
Train With Mixed Precision - NVIDIA Documentation Center
Porting the model to use the FP16 data type where appropriate. Adding loss scaling to preserve small gradient values. The ability to train...
Read more >
Help & Questions Megathread! : r/StableDiffusion - Reddit
problem when i click train (dream booth) (automatic 1111). "Returning result: Training finished. Total lifetime steps: 0" is what i get 3 mins ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found