have trouble about resume from ckpt
See original GitHub issueI try to train DPRNN and DPTNet on wsj0-3mix from a checkpoint. and val_loss look like this (blue line is train from a ckpt). I think it should be a curve much smoother.
Scripts were copied from wham folder. to resume from ckpt I add some code in train.py like this:
state_dict = torch.load(os.path.join(exp_dir, ckpt_name))
system.load_state_dict(state_dict['state_dict'])
...
trainer = pl.Trainer(..., resume_from_checkpoint=os.path.join(exp_dir, ckpt_name))
And to train DPRNN & DPTNet on wsj0-3mix, I replace wham dataset & dataloader code in train.py to this:
train_loader, val_loader = make_dataloaders(n_src=3, **conf["data"], **conf["training"])
no change in conf.yml. I also tried deepclustering on wsj0-3mix, dprnn & dptnet on wham and my own dataset, did not find this problem.
what did I miss? Need help plz.
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (3 by maintainers)
Top Results From Across the Web
have trouble about resume from ckpt · Issue #322 · asteroid ...
I try to train DPRNN and DPTNet on wsj0-3mix from a checkpoint. and val_loss look like this (blue line is train from a...
Read more >Resume from checkpoint with elastic training
I use PyTorch Lightning with TorchElastic. My training function looks like this: import pytorch_lightning as pl # Each train() call runs as ...
Read more >I am trying to resume training from a certain checkpoint ...
I am trying to resume training from a certain checkpoint (Tensorflow) because I'm using Colab and 12 hours aren't enough · python ·...
Read more >Use Checkpoints in Amazon SageMaker - AWS Documentation
To resume a training job from a checkpoint, run a new estimator with the same checkpoint_s3_uri that you created in the Enable Checkpointing...
Read more >Training Won't resume from latest checkpoint - Support
Having problems copying and pasting the tensorflow warnings/output etc I get when I start training the model as it says 'new users can...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I have a pull request which should solve this issue. Need feedback: https://github.com/mpariente/asteroid/pull/326
definitely one of the issues but not the main one changed it to the torch one
Cool ! The other question is: is the scheduler’s resumed when using
resume_from_checkpoint
?What about this difference:
I’m not sure the optimizer should be in the state_dict…