question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about loading checkpoint

See original GitHub issue

I was able to resume training by only ‘load_state_dict’ in monitor peer before using hivemind 1.0.0 version. The code looks like this:

# monitor peer
if load_from_pretrained:
  self.model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"), strict=False)
  ...
  self.collaborative_optimizer.load_state_dict(torch.load("optimizer.pt", map_location="cpu"))

The peers would load from monitor’s state after start up.

However, in ver 1.0.0 or master code, ‘load_state_dict’ in monitor seems not work. My question is am I using the wrong method or should I load the checkpoint on the worker peer?

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
finger92commented, Jan 19, 2022

I solved this! In example/albert The trainer peer used scheduler while monitor peer not. Which results in some differences in the “optimizer state_dict” of the two peers(scheduler will add a ‘initial_lr’ to optimizer’s state). After adding a non-functional “scheduler” in monitor peer it works fine.

By the way, I changed “prefix” in state_averager in monitor peer’s code to let trainer could download state from monitor

self.state_averager = TrainingStateAverager(
            dht=dht,
            optimizer=opt,
            prefix=f"{experiment_prefix}_state_averager",
            state_compression=hivemind.Float16Compression(),
            bandwidth=optimizer_args.bandwidth,
            client_mode=optimizer_args.client_mode,
            start=True,
            **asdict(averager_args),
        )
0reactions
justheuristiccommented, Jan 19, 2022

Hi! Awesome work! Feel free to ping us if you encounter any more oddities 😃

We’ll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[RLlib] Questions about loading checkpoint and ...
This is a problem for me because the local worker is loaded first and with asynchronous evaluation, while loading other workers, the agent...
Read more >
Saving and loading a general checkpoint in PyTorch
In this recipe, we will explore how to save and load multiple checkpoints. Setup. Before we begin, we need to install torch if...
Read more >
Check Points Website and loading problem after sea...
Check Points Website "Loading" for ever. Anyone experience loading problems with the website. Has been going on for a long time but seems...
Read more >
Loading model from checkpoint is not working
The problem with the latter case is that VanillaVAE.model.encoder doesn't exist. · Thank you Roman. That is the correct answer. Can't believe I ......
Read more >
Save and load models | TensorFlow Core
Setup. Installs and imports; Get an example dataset; Define a model ; Save checkpoints during training. Checkpoint callback usage; Checkpoint ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found