Dreambooth doesn't train on 8GB
See original GitHub issueDescribe the bug
Per the example featured in the repo, it goes OOM when DeepSpeed is loading the optimizer, tested on a 3080 10GB + 64GB RAM in WSL2 and native Linux.
Reproduction
Follow the pastebin for setup purposes (on WSL2), or just try it yourself https://pastebin.com/0NHA5YTP
Logs
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2022-10-11 17:16:38,700] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-10-11 17:16:48,338] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.3, git-hash=unknown, git-branch=unknown
[2022-10-11 17:16:50,220] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-10-11 17:16:50,271] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-10-11 17:16:50,272] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-11 17:16:50,272] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:134:__init__] Reduce bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:135:__init__] Allgather bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:136:__init__] CPU Offload: True
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:137:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.15820693969726562 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2022-10-11 17:16:52,613] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-11 17:16:52,614] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB Max_MA 1.66 GB CA 3.27 GB Max_CA 3 GB
[2022-10-11 17:16:52,614] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 7.68 GB, percent = 16.3%
Traceback (most recent call last):
File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 598, in <module>
main()
File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 478, in main
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 679, in prepare
result = self._prepare_deepspeed(*args)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 890, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1144, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1395, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in __init__
self.initialize_optimizer_states()
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 599, in initialize_optimizer_states
i].grad = single_grad_partition.pin_memory(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /root/anaconda3/envs/diffusers-ttl/bin/python
System Info
3080 10GB + 64GB RAM, WSL2 and Linux
Issue Analytics
- State:
- Created a year ago
- Comments:50 (10 by maintainers)
Top Results From Across the Web
DreamBooth training in under 8 GB VRAM and textual ...
The drawback is of course that now the training requires significantly more RAM (about 25 GB).
Read more >Dreambooth on 8GB VRam GPU (holy grail) · Issue #3586 · ...
Dreambooth training on a 8 GB VRam GPU (holy grail) By using DeepSpeed it's possible to offload some tensors from VRAM to either...
Read more >Suraj Patil
Now it's possible to train #Dreambooth #stableDifusion on a 8GB GPU using diffusers with Accelerate and DeepSpeed! Thanks a lot for the amazing ......
Read more >DreamBooth fine-tuning example
The Dreambooth training script shows how to implement this training procedure on a pre-trained Stable Diffusion model. Dreambooth fine-tuning is very sensitive ...
Read more >Dreambooth, 10GB VRAM, 50% Faster, for FREE! - YouTube
Train on Your Own face - Dreambooth, 10GB VRAM, 50% Faster, for FREE!
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
To reduce VRAM usage while generating class images, try to use
--sample_batch_size=1
(the default is 4). Or generate them on the CPU by usingaccelerate launch --cpu train_dreambooth.py ...
, then stop the script and restart the training on the GPU again.Did you test to see if the 22H2 on Windows 10 increased the amount of memory pinning? If the update didn’t do so, then it still won’t work on Windows 10.
Try this test mentioned above,