Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dreambooth: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasLtMatmul

See original GitHub issue

Describe the bug

Hi - I’ve spent a couple days trying to get Dreambooth to run, and can’t get past this:

_Steps: 0%| | 0/800 [00:00<?, ?it/s] Traceback (most recent call last): File “/scratch/StableDiffusion/diffusers/examples/dreambooth/train_dreambooth.py”, line 765, in <module> main() File “/scratch/StableDiffusion/diffusers/examples/dreambooth/train_dreambooth.py”, line 712, in main noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl return forward_call(*input, **kwargs) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/deepspeed/utils/nvtx.py”, line 11, in wrapped_fn return func(*args, **kwargs) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/deepspeed/runtime/engine.py”, line 1673, in forward loss = self.module(*inputs, **kwargs) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl return forward_call(*input, **kwargs) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/diffusers/models/unet_2d_condition.py”, line 287, in forward emb = self.time_embedding(t_emb) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl return forward_call(*input, **kwargs) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/diffusers/models/embeddings.py”, line 75, in forward sample = self.linear_1(sample) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 1190, in call_impl return forward_call(*input, **kwargs) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/linear.py”, line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream()) Steps: 0%| | 0/800 [00:00<?, ?it/s] [2022-10-31 12:46:24,888] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 711745 [2022-10-31 12:46:24,889] [ERROR] [launch.py:292:sigkill_handler] [‘/home/stablediffusion/.conda/envs/diffusers/bin/python’, ‘-u’, ‘train_dreambooth.py’, ‘–pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5’, ‘–instance_data_dir=training/dataset’, ‘–class_data_dir=classes’, ‘–output_dir=output’, ‘–instance_prompt=MyObject dragon’, ‘–class_prompt=dragon’, ‘–seed=3434554’, ‘–resolution=512’, ‘–center_crop’, ‘–train_batch_size=1’, ‘–mixed_precision=fp16’, ‘–use_8bit_adam’, ‘–gradient_accumulation_steps=1’, ‘–gradient_checkpointing’, ‘–learning_rate=5e-6’, ‘–lr_scheduler=constant’, ‘–lr_warmup_steps=0’, ‘–num_class_images=100’, ‘–sample_batch_size=4’, ‘–max_train_steps=800’] exits with return code = 1 Traceback (most recent call last): File “/home/stablediffusion/.conda/envs/diffusers/bin/accelerate”, line 8, in <module> sys.exit(main()) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py”, line 43, in main args.func(args) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py”, line 827, in launch_command deepspeed_launcher(args) File “/home/stablediffusion/.conda/envs/diffusers/lib/python3.9/site-packages/accelerate/commands/launch.py”, line 540, in deepspeed_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command ‘[‘deepspeed’, ‘–no_local_rank’, ‘–num_gpus’, ‘1’, ‘train_dreambooth.py’, ‘–pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5’, ‘–instance_data_dir=training/dataset’, ‘–class_data_dir=classes’, ‘–output_dir=output’, ‘–instance_prompt=MyObject dragon’, ‘–class_prompt=dragon’, ‘–seed=3434554’, ‘–resolution=512’, ‘–center_crop’, ‘–train_batch_size=1’, ‘–mixed_precision=fp16’, ‘–use_8bit_adam’, ‘–gradient_accumulation_steps=1’, ‘–gradient_checkpointing’, ‘–learning_rate=5e-6’, ‘–lr_scheduler=constant’, ‘–lr_warmup_steps=0’, ‘–num_class_images=100’, ‘–sample_batch_size=4’, ‘–max_train_steps=800’]’ returned non-zero exit status 1.

I can run other CUDA apps just fine. No other GPU-using apps are running.

Reproduction

export MODEL_NAME=“runwayml/stable-diffusion-v1-5” export INSTANCE_DIR=“training/dataset” export CLASS_DIR=“classes” export OUTPUT_DIR=“output”

accelerate launch train_dreambooth.py
–pretrained_model_name_or_path=$MODEL_NAME
–instance_data_dir=$INSTANCE_DIR
–class_data_dir=$CLASS_DIR
–output_dir=$OUTPUT_DIR
–instance_prompt=“MyObject dragon”
–class_prompt=“dragon”
–seed=3434554
–resolution=512
–center_crop
–train_batch_size=1
–mixed_precision=“fp16”
–use_8bit_adam
–gradient_accumulation_steps=1 --gradient_checkpointing
–learning_rate=5e-6
–lr_scheduler=“constant”
–lr_warmup_steps=0
–num_class_images=100
–sample_batch_size=4
–max_train_steps=800

Logs

See above.

System Info

diffusers version: 0.7.0.dev0
Platform: Linux-5.19.16-200.fc36.x86_64-x86_64-with-glibc2.35
Python version: 3.9.13
PyTorch version (GPU?): 1.13.0+cu116 (True)
Huggingface_hub version: 0.10.1
Transformers version: 4.23.1
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: <fill in>

GPU is a RTX 3060 (12GB), hence the need to limit memory usage.

Issue Analytics

State:
Created a year ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

enn-nafnlauscommented, Nov 9, 2022

Are you using deepspeed for training ? If yes, then I would suggest to remove the --use_8bit_adam option as it doesn’t play well with deepspeed AFAIK.

Yes, I am - thanks for the tip; will try it out as soon as a (currently running) hypernetwork training run completes and frees up the card! 😃

0reactions

github-actions[bot]commented, Dec 4, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when ...

environ['CUDA_LAUNCH_BLOCKING'] = "1" command after I got RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle) ...

CUBLAS_STATUS_EXECUTION...

Got the same tracktrace as above (RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, ...

cuda error: cublas_status_execution_failed when ... - You.com

This time restarting did not fix it. Another weird thing is when run my script in terminal it outputs RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED ......

CUBLAS_STATUS_EXECUTION...

RuntimeError : CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, ...

DreamBooth training in under 8 GB VRAM and textual ... - Reddit

pin_memory( RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace ...