Issues with saving model/optimizer and loading them back
See original GitHub issueHello @sgugger ,
Came across multiple related issues regarding this - https://github.com/huggingface/accelerate/issues/242, https://github.com/huggingface/accelerate/issues/154 . They were all closed with this PR - https://github.com/huggingface/accelerate/pull/255, but unfortunately the PR doesn’t seem to have much documentation.
I was looking for specifically: saving a model, it’s optimizer state, LR scheduler state, it’s random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.
I know there’s this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it’s usage.
a) My question is specifically that, like in the official example here: link that saves using save_pretrained
only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)?
b) Does the save_state method call save_pretrained
methods for the model internally or do I have to do both? FWIW, I’m using HF’s BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there’s any simple toy example that already uses these new checkpointing methods, and if you can help share that’d be pretty helpful!
The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?
Request: If some more detailed examples can be added to the docs that’d be really awesome and help clarify about some of these specifics to users more easily!
Thanks so much in advance! 😃
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (9 by maintainers)
Top GitHub Comments
cc @muellerzr
Update: I tried deepspeed
model.save_checkpoint
andmodel.load_checkpoint
. It worked, it can restore optimizer and scheduler states, when they are created viaDummyOptim
andDummyScheduler
.It can also load normal optimizer states, but not
LambdaLR
schedulers.