Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training stops when reaching 500 steps on T4

See original GitHub issue

Im using AWS T4 GPU, and training always stops when reaching steps 500

Error im getting:

Traceback (most recent call last):
  File "diffusers/examples/dreambooth/train_dreambooth.py", line 719, in <module>
    main(args)
  File "diffusers/examples/dreambooth/train_dreambooth.py", line 638, in main
    model_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/sd/diffusers/src/diffusers/models/unet_2d_condition.py", line 375, in forward
    sample = self.conv_in(sample)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 460, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
Steps:  62%|██████████████████▊           | 500/800 [19:51<11:55,  2.38s/it, loss=0.598, lr=8.15e-7]
Traceback (most recent call last):
  File "/home/ec2-user/.local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'diffusers/examples/dreambooth/train_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2', '--instance_data_dir=sona', '--class_data_dir=class/Women', '--output_dir=sona_output', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=sona', '--class_prompt=a photo of person', '--resolution=512', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--gradient_checkpointing', '--use_8bit_adam', '--learning_rate=2e-6', '--lr_scheduler=polynomial', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=800']' returned non-zero exit status 1.

Train code im using:

python3 diffusers/examples/dreambooth/train_dreambooth.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-2"  \
  --instance_data_dir="sona" \
  --class_data_dir="class/Women" \
  --output_dir="sona_output" \
  --with_prior_preservation --prior_loss_weight=1.0 \
  --instance_prompt="sona" \
  --class_prompt="a photo of person" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=2 --gradient_checkpointing \
  --use_8bit_adam \
  --learning_rate=2e-6 \
  --lr_scheduler="polynomial" \
  --lr_warmup_steps=0 \
  --num_class_images=200 \
  --max_train_steps=800

I have 15360MiB of GPU vRAM

Issue Analytics

State:
Created 9 months ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

pcuencacommented, Dec 13, 2022

--save_steps has been replaced by --checkpointing_steps and --resume_from_checkpoint (#1668), which correctly store optimizer state. This issue should no longer happen, please reopen if that’s not the case.

0reactions

kowalgregycommented, Dec 11, 2022

Hi, also have this problem on A100 GPU, was also able to solve it by creating a new environment earlier, but the issue reoccurred. Not sure what package install/update/change caused it to resurface. A temporary fix of not using save_steps also works here.

Top Results From Across the Web

Contest Preparation Using Science! - Bodybuilding.com

This is the very first step bodybuilders commonly use during the pre-contest ... Training Schedule During The Carbohydrate Depletion Phase.

Withdrawals and the Return of Title IV Funds

The requirements for the treatment of Title IV funds when a student withdraws apply to any recipient of Title IV grant or loan...

Dealing With Weight Loss Plateaus in Hypothyroidism

Restoring Metabolism. The first step to overcoming a weight loss plateau is to access the state of your thyroid function. Healthcare providers ...

RsLogix 500 Training - Timers - TON, TOF, and RTO

This lesson explains RsLogix 500 instructions: TON (Timer on delay), TOF (Timer off delay), and RTO (Retentive Timer On).

Efficiency of T4 Gene 60 Translational Bypassing - PMC - NCBI

Efficient T4 gene 60 translational bypassing requires five bypassing signals. These signals include matching GGA codons bordering the gap, a stop codon, ...