Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question about loading checkpoint

See original GitHub issue

I was able to resume training by only ‘load_state_dict’ in monitor peer before using hivemind 1.0.0 version. The code looks like this:

# monitor peer
if load_from_pretrained:
  self.model.load_state_dict(torch.load("pytorch_model.bin", map_location="cpu"), strict=False)
  ...
  self.collaborative_optimizer.load_state_dict(torch.load("optimizer.pt", map_location="cpu"))

The peers would load from monitor’s state after start up.

However, in ver 1.0.0 or master code, ‘load_state_dict’ in monitor seems not work. My question is am I using the wrong method or should I load the checkpoint on the worker peer?

Issue Analytics

State:
Created 2 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

finger92commented, Jan 19, 2022

I solved this! In example/albert The trainer peer used scheduler while monitor peer not. Which results in some differences in the “optimizer state_dict” of the two peers(scheduler will add a ‘initial_lr’ to optimizer’s state). After adding a non-functional “scheduler” in monitor peer it works fine.

By the way, I changed “prefix” in state_averager in monitor peer’s code to let trainer could download state from monitor

self.state_averager = TrainingStateAverager(
            dht=dht,
            optimizer=opt,
            prefix=f"{experiment_prefix}_state_averager",
            state_compression=hivemind.Float16Compression(),
            bandwidth=optimizer_args.bandwidth,
            client_mode=optimizer_args.client_mode,
            start=True,
            **asdict(averager_args),
        )

0reactions

justheuristiccommented, Jan 19, 2022

Hi! Awesome work! Feel free to ping us if you encounter any more oddities 😃

We’ll incorporate your fixes into the example in the coming days (within a week or two at most) and write back to you with an update.

Top Results From Across the Web

[RLlib] Questions about loading checkpoint and ...

This is a problem for me because the local worker is loaded first and with asynchronous evaluation, while loading other workers, the agent...

Saving and loading a general checkpoint in PyTorch

In this recipe, we will explore how to save and load multiple checkpoints. Setup. Before we begin, we need to install torch if...

Check Points Website and loading problem after sea...

Check Points Website "Loading" for ever. Anyone experience loading problems with the website. Has been going on for a long time but seems...

Loading model from checkpoint is not working

The problem with the latter case is that VanillaVAE.model.encoder doesn't exist. · Thank you Roman. That is the correct answer. Can't believe I ......

Save and load models | TensorFlow Core

Setup. Installs and imports; Get an example dataset; Define a model ; Save checkpoints during training. Checkpoint callback usage; Checkpoint ...