question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to save and load model, optimizer and scheduler's state dictionary?

See original GitHub issue

How do I save and load the model, optimizer and scheduler state dictionarys that has gone through accelerator.prepare()?

for model

I used the unwrap function as described in the documentation

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(args.model_path, 
                            save_function=accelerator.save, 
                            state_dict=accelerator.get_state_dict(model))

however, I get the following error when loading the model model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

    model, optimizer, training_loader, dev_loader = accelerator.prepare(
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in prepare
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in <genexpr>
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 227, in _prepare_one
    return self.prepare_model(obj)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 285, in prepare_model
    model = model.to(self.device)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

For optimizer and scheduler

currently using torch.save(optimizer.state_dict(), /exp1/file.opt) for save gives the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable when loading using optimizer.load_state_dict(torch.load('exp1/file.opt'))

Does accelerator.unwrap( work the same way as for a model?

accelerator.wait_for_everyone()
unwrapped_optmizer = accelerator.unwrap_model(optmizer)
accelerator.save(unwrapped_optmizer.state_dict(), filename)

Using torch.save(scheduler.state_dict(), /exp1/sch)and loading withscheduler.load_state_dict(torch.load(‘path’)` is working.

EDITS: I updated the original issue with more details and exact error messages.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
sguggercommented, Feb 24, 2022

It’s under development on #255, we’re hoping to have it merge next week.

1reaction
sguggercommented, Sep 2, 2021

You should use accelerator.save everywhere and not torch.save (though I must say I have never seen that particular error). For reloading, you should be able to reload a state dict in the unwrapped model or the optimizer. If you do

model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

You create a brand new model, so you should pass it to the prepare method.

Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Saving model AND optimiser AND scheduler - PyTorch Forums
One idea - use the torch.save(model) - this will pickle the model class and reproduce the object and load the state_dict, but will...
Read more >
Saving optimizer - Accelerate - Hugging Face Forums
If I want to save the model I will unwrap the model first by doing unwrap_model() . The wrapper of the optimizer in...
Read more >
Save and load model optimizer state - python - Stack Overflow
I have a set of fairly complicated models that I am training and I am looking for a way to save and load...
Read more >
Saving And Loading Models - PyTorch Beginner 17
In this part we will learn how to save and load our model. I will show you the different functions you have to...
Read more >
On saving and loading - Stable Baselines3 - Read the Docs
A zip-archived JSON dump, PyTorch state dictionaries and PyTorch variables. ... model parameters and optimizers are serialized with torch.save() function ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found