Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues with saving model/optimizer and loading them back

See original GitHub issue

Hello @sgugger ,

Came across multiple related issues regarding this - https://github.com/huggingface/accelerate/issues/242, https://github.com/huggingface/accelerate/issues/154 . They were all closed with this PR - https://github.com/huggingface/accelerate/pull/255, but unfortunately the PR doesn’t seem to have much documentation.

I was looking for specifically: saving a model, it’s optimizer state, LR scheduler state, it’s random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.

I know there’s this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it’s usage. a) My question is specifically that, like in the official example here: link that saves using save_pretrained only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)? b) Does the save_state method call save_pretrained methods for the model internally or do I have to do both? FWIW, I’m using HF’s BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there’s any simple toy example that already uses these new checkpointing methods, and if you can help share that’d be pretty helpful!

The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?

Request: If some more detailed examples can be added to the docs that’d be really awesome and help clarify about some of these specifics to users more easily!

Thanks so much in advance! 😃

Issue Analytics

State:
Created 2 years ago
Comments:18 (9 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Jul 4, 2022

cc @muellerzr

1reaction

cccntucommented, Jul 6, 2022

Update: I tried deepspeed model.save_checkpoint and model.load_checkpoint. It worked, it can restore optimizer and scheduler states, when they are created via DummyOptim and DummyScheduler.

It can also load normal optimizer states, but not LambdaLR schedulers.

Top Results From Across the Web

Save and load model optimizer state - python - Stack Overflow

However, I need a method for saving and loading the states of the optimizers of my trainer models. It seems as though keras...

Saving optimizer - Accelerate - Hugging Face Forums

I get an error: Exception in device=TPU:0: 140247764057056 Traceback (most recent call last): File "/usr/local/lib/python3. 7/dist-packages/torch_xla/ ...

Saving and Loading Models - PyTorch

When it comes to saving and loading models, there are three core functions to be familiar with: torch.save: Saves a serialized object to...

Saving and Loading Models — PyTorch Tutorials 1.0.0 ...

This document provides solutions to a variety of use cases regarding the saving and loading of PyTorch models. Feel free to read the...

How to Save and Load Models in PyTorch - Wandb

Model training is expensive and takes a lot of time for practical use cases. Saving the trained model is usually the last step...