Question about loading checkpoint
See original GitHub issueI was able to resume training by only ‘load_state_dict’ in monitor peer before using hivemind 1.0.0 version. The code looks like this:
# monitor peer
if load_from_pretrained:
self.model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"), strict=False)
...
self.collaborative_optimizer.load_state_dict(torch.load("optimizer.pt", map_location="cpu"))
The peers would load from monitor’s state after start up.
However, in ver 1.0.0 or master code, ‘load_state_dict’ in monitor seems not work. My question is am I using the wrong method or should I load the checkpoint on the worker peer?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (6 by maintainers)
Top Results From Across the Web
[RLlib] Questions about loading checkpoint and ...
This is a problem for me because the local worker is loaded first and with asynchronous evaluation, while loading other workers, the agent...
Read more >Saving and loading a general checkpoint in PyTorch
In this recipe, we will explore how to save and load multiple checkpoints. Setup. Before we begin, we need to install torch if...
Read more >Check Points Website and loading problem after sea...
Check Points Website "Loading" for ever. Anyone experience loading problems with the website. Has been going on for a long time but seems...
Read more >Loading model from checkpoint is not working
The problem with the latter case is that VanillaVAE.model.encoder doesn't exist. · Thank you Roman. That is the correct answer. Can't believe I ......
Read more >Save and load models | TensorFlow Core
Setup. Installs and imports; Get an example dataset; Define a model ; Save checkpoints during training. Checkpoint callback usage; Checkpoint ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I solved this! In example/albert The trainer peer used scheduler while monitor peer not. Which results in some differences in the “optimizer state_dict” of the two peers(scheduler will add a ‘initial_lr’ to optimizer’s state). After adding a non-functional “scheduler” in monitor peer it works fine.
By the way, I changed “prefix” in state_averager in monitor peer’s code to let trainer could download state from monitor
Hi! Awesome work! Feel free to ping us if you encounter any more oddities 😃
We’ll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update.