question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training memory optimizations not working on AMD hardware

See original GitHub issue

Describe the bug

Dreambooth training example has a section about training on 16GB GPU. As Radeon Navi 21 series models all have 16GB available this in theory would increase the amount of hardware that can train models by a really large margin.

Problem is that at least out of the box neither of the optimizations --gradient_checkpointing nor --use_8bit_adam seem to support AMD cards.

Reproduction

Using the example command command with pytorch rocm 5.1.1 (pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/rocm5.1.1)

--gradient_checkpointing: returns error 'UNet2DConditionModel' object has no attribute 'enable_gradient_checkpointing' --use_8bit_adam: throws handful of CUDA errors, see Logs section below for the main part (is bitsandbytes Nvidia specific and if it is is there an AMD implementation available?)

Logs

using --gradient_checkpointing:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `12` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
Traceback (most recent call last):
  File "/home/foobar/diffusers/examples/dreambooth/train_dreambooth.py", line 606, in <module>
    main()
  File "/home/foobar/diffusers/examples/dreambooth/train_dreambooth.py", line 408, in main
    unet.enable_gradient_checkpointing()
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1207, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'UNet2DConditionModel' object has no attribute 'enable_gradient_checkpointing'
Traceback (most recent call last):
  File "/home/foobar/diffusers/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

--use_8bit_adam:

...
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/bitsandbytes/cuda_setup/paths.py:20: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
  warn(
WARNING: No libcudart.so found! Install CUDA or the cudatoolkit package (anaconda)!
CUDA SETUP: Loading binary /home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so...
/home/foobar/pyenvtest/.venv/lib/python3.9/site-packages/bitsandbytes/cextension.py:48: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
...


### System Info

- `diffusers` version: 0.3.0
- Platform: Linux-5.15.67-x86_64-with-glibc2.34
- Python version: 3.9.13
- PyTorch version (GPU?): 1.12.1+rocm5.1.1 (True)
- Huggingface_hub version: 0.9.1
- Transformers version: 4.22.2
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:17 (4 by maintainers)

github_iconTop GitHub Comments

4reactions
patrickvonplatencommented, Oct 4, 2022

We will release a new diffusers version very soon!

2reactions
hopibelcommented, Oct 2, 2022

Not related to AMD

The use_8bit_adam problems potentially are, as bitsandbytes includes a C extension which wraps some CUDA functions, i.e, doesn’t run through pytorch-rocm. Not really anything that can be fixed on this end though

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting Tips for Resolving System Stability Issues
Verify that the cooling solution for the CPU, graphics card, power supply, and system are working properly. Clear dust and tidy loose cables...
Read more >
How to Troubleshoot Timeout Detection and Recovery Errors
This article provides information on the possible causes of Timeout Detection and Recovery (TDR) errors and troubleshooting steps to help prevent TDR errors ......
Read more >
How to Tune GPU Performance Using Radeon Software - AMD
This article provides information on tuning your GPU performance using Radeon™ Software and is organized into the following sections:
Read more >
Optimize GPU Performance for Computing Applications ... - AMD
This article provides information on optimizing GPU performance for Graphics or Compute applications with AMD Software: Adrenalin Edition.
Read more >
Configure AMD Radeon™ Settings for Ultimate Gaming ...
Surface Format Optimization enables the graphics driver to change rendering ... which may result in improved performance and lower video memory usage.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found