question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues with saving model/optimizer and loading them back

See original GitHub issue

Hello @sgugger ,

Came across multiple related issues regarding this - https://github.com/huggingface/accelerate/issues/242, https://github.com/huggingface/accelerate/issues/154 . They were all closed with this PR - https://github.com/huggingface/accelerate/pull/255, but unfortunately the PR doesn’t seem to have much documentation.

I was looking for specifically: saving a model, it’s optimizer state, LR scheduler state, it’s random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.

I know there’s this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it’s usage. a) My question is specifically that, like in the official example here: link that saves using save_pretrained only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)? b) Does the save_state method call save_pretrained methods for the model internally or do I have to do both? FWIW, I’m using HF’s BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there’s any simple toy example that already uses these new checkpointing methods, and if you can help share that’d be pretty helpful!

The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?

Request: If some more detailed examples can be added to the docs that’d be really awesome and help clarify about some of these specifics to users more easily!

Thanks so much in advance! 😃

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Jul 4, 2022
1reaction
cccntucommented, Jul 6, 2022

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Save and load model optimizer state - python - Stack Overflow
However, I need a method for saving and loading the states of the optimizers of my trainer models. It seems as though keras...
Read more >
Saving optimizer - Accelerate - Hugging Face Forums
I get an error: Exception in device=TPU:0: 140247764057056 Traceback (most recent call last): File "/usr/local/lib/python3. 7/dist-packages/torch_xla/ ...
Read more >
Saving and Loading Models - PyTorch
When it comes to saving and loading models, there are three core functions to be familiar with: torch.save: Saves a serialized object to...
Read more >
Saving and Loading Models — PyTorch Tutorials 1.0.0 ...
This document provides solutions to a variety of use cases regarding the saving and loading of PyTorch models. Feel free to read the...
Read more >
How to Save and Load Models in PyTorch - Wandb
Model training is expensive and takes a lot of time for practical use cases. Saving the trained model is usually the last step...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found