question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to load from checkpoint after migrating model to different machine

See original GitHub issue

Hi,

The training was running on with a small GPU so I saved the latest model, 450th epoch (model_450.pt file) and moved it to another machine. I placed the saved model at models/default/model_450.pt in new machine. I have same version of stylegan (1.8.1) on both machines. Now when I run the command with same, it gives me the following error:

continuing from previous epoch - 450
loading from version 1.8.1
unable to load save model. please try downgrading the package to the version specified by the saved model
Traceback (most recent call last):
  File "/opt/conda/bin/stylegan2_pytorch", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/cli.py", line 187, in main
    fire.Fire(train_from_folder)
  File "/opt/conda/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
    target=component.__name__)
  File "/opt/conda/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/cli.py", line 178, in train_from_folder
    run_training(0, 1, model_args, data, load_from, new, num_train_steps, name, seed)
  File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/cli.py", line 52, in run_training
    model.load(load_from)
  File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1394, in load
    raise e
  File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1391, in load
    self.GAN.load_state_dict(load_data['GAN'])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for StyleGAN2:
        size mismatch for G.blocks.3.to_noise1.weight: copying a param with shape torch.Size([256, 1]) from checkpoint, the shape in current model is torch.Size([512, 1]).
        size mismatch for G.blocks.3.to_noise1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
        size mismatch for G.blocks.3.conv1.weight: copying a param with shape torch.Size([256, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).

Would appreciate any help from anyone. Really do not want to train from the scratch.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
moinedgylabscommented, May 13, 2021

Thanks, @MartinKing01. I’ll give it a try.

0reactions
RobertRankinTRcommented, Jun 21, 2021

I get the same error, but unlike the OP I didn’t move to a different machine. In my case it is a version issue (trained on 1.8.1, and and now I can’t reload the model even with a downgraded version of the code). Did you see this issue? https://github.com/lucidrains/stylegan2-pytorch/issues/237

Read more comments on GitHub >

github_iconTop Results From Across the Web

Solved: Migrate import fails - Check Point CheckMates
Solved: I have created a migrate file for a R77.30 SmartCenter then imported it to my server running R80.. When attempting to start...
Read more >
Migrating model checkpoints | TensorFlow Core
You are adjusting your model in a way that risks changing variable names and paths (such as when incrementally migrating away from get_variable...
Read more >
Unable to take CheckPoint in Hyper-V - Microsoft Q&A
We are trying to take checkpoint for one VM from Hyper-V. But getting ... Hyper V VMMS, to check if there are any...
Read more >
How to Fix the Error: Hyper-V Checkpoint Operation Failed
This issue may occur in the following situations: Permissions for the snapshot folder are incorrect. A VM was improperly moved from another ...
Read more >
My AWS DeepRacer reinforcement learning model is failing to ...
We can't copy the model because the coach checkpoint metadata has been deleted from the S3 bucket. If you still have the file,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found