question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stops when reaching 500 steps on T4

See original GitHub issue

Im using AWS T4 GPU, and training always stops when reaching steps 500

Error im getting:

Traceback (most recent call last):
  File "diffusers/examples/dreambooth/train_dreambooth.py", line 719, in <module>
    main(args)
  File "diffusers/examples/dreambooth/train_dreambooth.py", line 638, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/sd/diffusers/src/diffusers/models/unet_2d_condition.py", line 375, in forward
    sample = self.conv_in(sample)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Steps:  62%|██████████████████▊           | 500/800 [19:51<11:55,  2.38s/it, loss=0.598, lr=8.15e-7]
Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'diffusers/examples/dreambooth/train_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2', '--instance_data_dir=sona', '--class_data_dir=class/Women', '--output_dir=sona_output', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=sona', '--class_prompt=a photo of person', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=2e-6', '--lr_scheduler=polynomial', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=800']' returned non-zero exit status 1.

Train code im using:

python3 diffusers/examples/dreambooth/train_dreambooth.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-2"  \
  --instance_data_dir="sona" \
  --class_data_dir="class/Women" \
  --output_dir="sona_output" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="sona" \
  --class_prompt="a photo of person" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=2 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800

I have 15360MiB of GPU vRAM

Issue Analytics

  • State:closed
  • Created 9 months ago
  • Comments:9 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
pcuencacommented, Dec 13, 2022

--save_steps has been replaced by --checkpointing_steps and --resume_from_checkpoint (#1668), which correctly store optimizer state. This issue should no longer happen, please reopen if that’s not the case.

0reactions
kowalgregycommented, Dec 11, 2022

Hi, also have this problem on A100 GPU, was also able to solve it by creating a new environment earlier, but the issue reoccurred. Not sure what package install/update/change caused it to resurface. A temporary fix of not using save_steps also works here.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Contest Preparation Using Science! - Bodybuilding.com
This is the very first step bodybuilders commonly use during the pre-contest ... Training Schedule During The Carbohydrate Depletion Phase.
Read more >
Withdrawals and the Return of Title IV Funds
The requirements for the treatment of Title IV funds when a student withdraws apply to any recipient of Title IV grant or loan...
Read more >
Dealing With Weight Loss Plateaus in Hypothyroidism
Restoring Metabolism. The first step to overcoming a weight loss plateau is to access the state of your thyroid function. Healthcare providers ...
Read more >
RsLogix 500 Training - Timers - TON, TOF, and RTO
This lesson explains RsLogix 500 instructions: TON (Timer on delay), TOF (Timer off delay), and RTO (Retentive Timer On).
Read more >
Efficiency of T4 Gene 60 Translational Bypassing - PMC - NCBI
Efficient T4 gene 60 translational bypassing requires five bypassing signals. These signals include matching GGA codons bordering the gap, a stop codon, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found