how to save and load model, optimizer and scheduler's state dictionary?
See original GitHub issueHow do I save and load the model, optimizer and scheduler state dictionarys that has gone through accelerator.prepare()
?
for model
I used the unwrap function as described in the documentation
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(args.model_path,
save_function=accelerator.save,
state_dict=accelerator.get_state_dict(model))
however, I get the following error when loading the model
model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)
model, optimizer, training_loader, dev_loader = accelerator.prepare(
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in prepare
result = tuple(self._prepare_one(obj) for obj in args)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in <genexpr>
result = tuple(self._prepare_one(obj) for obj in args)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 227, in _prepare_one
return self.prepare_model(obj)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 285, in prepare_model
model = model.to(self.device)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
return self._apply(convert)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
module._apply(fn)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
param_applied = fn(param)
File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
For optimizer and scheduler
currently using torch.save(optimizer.state_dict(), /exp1/file.opt
) for save gives the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
when loading using optimizer.load_state_dict(torch.load('exp1/file.opt'))
Does accelerator.unwrap(
work the same way as for a model?
accelerator.wait_for_everyone()
unwrapped_optmizer = accelerator.unwrap_model(optmizer)
accelerator.save(unwrapped_optmizer.state_dict(), filename)
Using torch.save(scheduler.state_dict(), /exp1/sch
)and loading with
scheduler.load_state_dict(torch.load(‘path’)` is working.
EDITS: I updated the original issue with more details and exact error messages.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
It’s under development on #255, we’re hoping to have it merge next week.
You should use
accelerator.save
everywhere and nottorch.save
(though I must say I have never seen that particular error). For reloading, you should be able to reload a state dict in the unwrapped model or the optimizer. If you doYou create a brand new model, so you should pass it to the prepare method.
Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.