[Feature Request] Dreambooth - Save intermediate checkpoints
See original GitHub issueIs your feature request related to a problem? Please describe. Dreambooth can drastically change its output quality between step counts, including to the worse if the chosen learning rate is too high for the step count or amount of training / regularization images. This implementation only saves the model after training is finished, which requires full reruns to compare different step counts and also makes it impossible to salvage an overfitted model.
Describe the solution you’d like
A configurable way to save the model at certain step counts and continue training afterwards.
Optimally, the script would accept two new parameters, one to specify the step interval to save at and one to specify how many to keep before overwriting. In some of the popular non-diffuser-implementations like https://github.com/XavierXiao/Dreambooth-Stable-Diffusion and resulting forks, these arguments are called every_n_train_steps
and save_top_k
. However, since this implementation doesn’t generate intermediate checkpoints by default, it would probably be better to find a more descriptive name.
Describe alternatives you’ve considered Technically, it would also be possible to just manually resume training from a previous checkpoint and use low step counts for each run, however this requires additional effort and also is hard to do in some Colabs based on this implementation, so an integrated solution would be preferred.
Additional context
I tried a naive implementation by simply calling pipeline.save_pretrained
after every X steps, however this would lead to an error after successfully saving a few files:
File "/diffusers/pipeline_utils.py", line 158, in save_pretrained save_method = getattr(sub_model, save_method_name)
TypeError: getattr(): attribute name must be string
I called the method in the same way as the final save, including a call to accelerator.wait_for_everyone()
beforehand, as suggested in the Accelerate documentation. Since I am not familiar with the Accelerate and Stable Diffusion architectures, I couldn’t find out why so far, but from the error message it seems that StableDiffusionPipeline could not find a valid save method name due to missing some information about the model at this point.
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:14 (11 by maintainers)
Top GitHub Comments
@DominikDoom Thanks a lot for the issue, working on adding intermidiate checkpoint saving.
@Cyberes It seems that the safety checker is not saved in the model that you are passing, that’s what the error indicated, make sure they safety checker is also saved there. Feel free to open an issue if the error persists even after that.
Fixed by #1668 (except keeping the last
n
checkpoints, to be adapted from https://github.com/huggingface/accelerate/issues/914).