Training stops when reaching 500 steps on T4
See original GitHub issueIm using AWS T4 GPU
, and training always stops when reaching steps 500
Error im getting:
Traceback (most recent call last):
File "diffusers/examples/dreambooth/train_dreambooth.py", line 719, in <module>
main(args)
File "diffusers/examples/dreambooth/train_dreambooth.py", line 638, in main
model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/sd/diffusers/src/diffusers/models/unet_2d_condition.py", line 375, in forward
sample = self.conv_in(sample)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Steps: 62%|██████████████████▊ | 500/800 [19:51<11:55, 2.38s/it, loss=0.598, lr=8.15e-7]
Traceback (most recent call last):
File "/home/ec2-user/.local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'diffusers/examples/dreambooth/train_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2', '--instance_data_dir=sona', '--class_data_dir=class/Women', '--output_dir=sona_output', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=sona', '--class_prompt=a photo of person', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=2e-6', '--lr_scheduler=polynomial', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=800']' returned non-zero exit status 1.
Train code im using:
python3 diffusers/examples/dreambooth/train_dreambooth.py \
--pretrained_model_name_or_path="stabilityai/stable-diffusion-2" \
--instance_data_dir="sona" \
--class_data_dir="class/Women" \
--output_dir="sona_output" \
--with_prior_preservation --prior_loss_weight=1.0 \
--instance_prompt="sona" \
--class_prompt="a photo of person" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 --gradient_checkpointing \
--use_8bit_adam \
--learning_rate=2e-6 \
--lr_scheduler="polynomial" \
--lr_warmup_steps=0 \
--num_class_images=200 \
--max_train_steps=800
I have 15360MiB
of GPU vRAM
Issue Analytics
- State:
- Created 9 months ago
- Comments:9 (3 by maintainers)
Top Results From Across the Web
Contest Preparation Using Science! - Bodybuilding.com
This is the very first step bodybuilders commonly use during the pre-contest ... Training Schedule During The Carbohydrate Depletion Phase.
Read more >Withdrawals and the Return of Title IV Funds
The requirements for the treatment of Title IV funds when a student withdraws apply to any recipient of Title IV grant or loan...
Read more >Dealing With Weight Loss Plateaus in Hypothyroidism
Restoring Metabolism. The first step to overcoming a weight loss plateau is to access the state of your thyroid function. Healthcare providers ...
Read more >RsLogix 500 Training - Timers - TON, TOF, and RTO
This lesson explains RsLogix 500 instructions: TON (Timer on delay), TOF (Timer off delay), and RTO (Retentive Timer On).
Read more >Efficiency of T4 Gene 60 Translational Bypassing - PMC - NCBI
Efficient T4 gene 60 translational bypassing requires five bypassing signals. These signals include matching GGA codons bordering the gap, a stop codon, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
--save_steps
has been replaced by--checkpointing_steps
and--resume_from_checkpoint
(#1668), which correctly store optimizer state. This issue should no longer happen, please reopen if that’s not the case.Hi, also have this problem on A100 GPU, was also able to solve it by creating a new environment earlier, but the issue reoccurred. Not sure what package install/update/change caused it to resurface. A temporary fix of not using save_steps also works here.