Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to save and load model, optimizer and scheduler's state dictionary?

See original GitHub issue

How do I save and load the model, optimizer and scheduler state dictionarys that has gone through accelerator.prepare()?

for model

I used the unwrap function as described in the documentation

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(args.model_path, 
                            save_function=accelerator.save, 
                            state_dict=accelerator.get_state_dict(model))

however, I get the following error when loading the model model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

    model, optimizer, training_loader, dev_loader = accelerator.prepare(
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in prepare
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in <genexpr>
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 227, in _prepare_one
    return self.prepare_model(obj)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 285, in prepare_model
    model = model.to(self.device)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

For optimizer and scheduler

currently using torch.save(optimizer.state_dict(), /exp1/file.opt) for save gives the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable when loading using optimizer.load_state_dict(torch.load('exp1/file.opt'))

Does accelerator.unwrap( work the same way as for a model?

accelerator.wait_for_everyone()
unwrapped_optmizer = accelerator.unwrap_model(optmizer)
accelerator.save(unwrapped_optmizer.state_dict(), filename)

Using torch.save(scheduler.state_dict(), /exp1/sch)and loading withscheduler.load_state_dict(torch.load(‘path’)` is working.

EDITS: I updated the original issue with more details and exact error messages.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

sguggercommented, Feb 24, 2022

It’s under development on #255, we’re hoping to have it merge next week.

1reaction

sguggercommented, Sep 2, 2021

You should use accelerator.save everywhere and not torch.save (though I must say I have never seen that particular error). For reloading, you should be able to reload a state dict in the unwrapped model or the optimizer. If you do

model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

You create a brand new model, so you should pass it to the prepare method.

Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.

Top Results From Across the Web

Saving model AND optimiser AND scheduler - PyTorch Forums

One idea - use the torch.save(model) - this will pickle the model class and reproduce the object and load the state_dict, but will...

Saving optimizer - Accelerate - Hugging Face Forums

If I want to save the model I will unwrap the model first by doing unwrap_model() . The wrapper of the optimizer in...

Save and load model optimizer state - python - Stack Overflow

I have a set of fairly complicated models that I am training and I am looking for a way to save and load...

Saving And Loading Models - PyTorch Beginner 17

In this part we will learn how to save and load our model. I will show you the different functions you have to...

On saving and loading - Stable Baselines3 - Read the Docs

A zip-archived JSON dump, PyTorch state dictionaries and PyTorch variables. ... model parameters and optimizers are serialized with torch.save() function ...