Unable to load from checkpoint after migrating model to different machine
See original GitHub issueHi,
The training was running on with a small GPU so I saved the latest model, 450th epoch (model_450.pt file) and moved it to another machine. I placed the saved model at models/default/model_450.pt
in new machine. I have same version of stylegan (1.8.1) on both machines.
Now when I run the command with same, it gives me the following error:
continuing from previous epoch - 450
loading from version 1.8.1
unable to load save model. please try downgrading the package to the version specified by the saved model
Traceback (most recent call last):
File "/opt/conda/bin/stylegan2_pytorch", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/cli.py", line 187, in main
fire.Fire(train_from_folder)
File "/opt/conda/lib/python3.7/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.7/site-packages/fire/core.py", line 471, in _Fire
target=component.__name__)
File "/opt/conda/lib/python3.7/site-packages/fire/core.py", line 681, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/cli.py", line 178, in train_from_folder
run_training(0, 1, model_args, data, load_from, new, num_train_steps, name, seed)
File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/cli.py", line 52, in run_training
model.load(load_from)
File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1394, in load
raise e
File "/opt/conda/lib/python3.7/site-packages/stylegan2_pytorch/stylegan2_pytorch.py", line 1391, in load
self.GAN.load_state_dict(load_data['GAN'])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for StyleGAN2:
size mismatch for G.blocks.3.to_noise1.weight: copying a param with shape torch.Size([256, 1]) from checkpoint, the shape in current model is torch.Size([512, 1]).
size mismatch for G.blocks.3.to_noise1.bias: copying a param with shape torch.Size([256]) from checkpoint, the shape in current model is torch.Size([512]).
size mismatch for G.blocks.3.conv1.weight: copying a param with shape torch.Size([256, 512, 3, 3]) from checkpoint, the shape in current model is torch.Size([512, 512, 3, 3]).
Would appreciate any help from anyone. Really do not want to train from the scratch.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (1 by maintainers)
Top Results From Across the Web
Solved: Migrate import fails - Check Point CheckMates
Solved: I have created a migrate file for a R77.30 SmartCenter then imported it to my server running R80.. When attempting to start...
Read more >Migrating model checkpoints | TensorFlow Core
You are adjusting your model in a way that risks changing variable names and paths (such as when incrementally migrating away from get_variable...
Read more >Unable to take CheckPoint in Hyper-V - Microsoft Q&A
We are trying to take checkpoint for one VM from Hyper-V. But getting ... Hyper V VMMS, to check if there are any...
Read more >How to Fix the Error: Hyper-V Checkpoint Operation Failed
This issue may occur in the following situations: Permissions for the snapshot folder are incorrect. A VM was improperly moved from another ...
Read more >My AWS DeepRacer reinforcement learning model is failing to ...
We can't copy the model because the coach checkpoint metadata has been deleted from the S3 bucket. If you still have the file,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks, @MartinKing01. I’ll give it a try.
I get the same error, but unlike the OP I didn’t move to a different machine. In my case it is a version issue (trained on 1.8.1, and and now I can’t reload the model even with a downgraded version of the code). Did you see this issue? https://github.com/lucidrains/stylegan2-pytorch/issues/237