question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

have trouble about resume from ckpt

See original GitHub issue

I try to train DPRNN and DPTNet on wsj0-3mix from a checkpoint. and val_loss look like this (blue line is train from a ckpt). I think it should be a curve much smoother. snap

Scripts were copied from wham folder. to resume from ckpt I add some code in train.py like this:

state_dict = torch.load(os.path.join(exp_dir, ckpt_name))   
system.load_state_dict(state_dict['state_dict'])
...
trainer = pl.Trainer(..., resume_from_checkpoint=os.path.join(exp_dir, ckpt_name))

And to train DPRNN & DPTNet on wsj0-3mix, I replace wham dataset & dataloader code in train.py to this:

train_loader, val_loader = make_dataloaders(n_src=3, **conf["data"], **conf["training"])

no change in conf.yml. I also tried deepclustering on wsj0-3mix, dprnn & dptnet on wham and my own dataset, did not find this problem.

what did I miss? Need help plz.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
popcornellcommented, Nov 10, 2020

I have a pull request which should solve this issue. Need feedback: https://github.com/mpariente/asteroid/pull/326

What about this difference:

definitely one of the issues but not the main one changed it to the torch one

0reactions
mparientecommented, Nov 10, 2020

Cool ! The other question is: is the scheduler’s resumed when using resume_from_checkpoint?

What about this difference:

# In torch
def state_dict(self):
    return {key: value for key, value in self.__dict__.items() if key != 'optimizer'}
# In asteroid
def state_dict(self):
    return {key: value for key, value in self.__dict__.items()}

I’m not sure the optimizer should be in the state_dict…

Read more comments on GitHub >

github_iconTop Results From Across the Web

have trouble about resume from ckpt · Issue #322 · asteroid ...
I try to train DPRNN and DPTNet on wsj0-3mix from a checkpoint. and val_loss look like this (blue line is train from a...
Read more >
Resume from checkpoint with elastic training
I use PyTorch Lightning with TorchElastic. My training function looks like this: import pytorch_lightning as pl # Each train() call runs as ...
Read more >
I am trying to resume training from a certain checkpoint ...
I am trying to resume training from a certain checkpoint (Tensorflow) because I'm using Colab and 12 hours aren't enough · python ·...
Read more >
Use Checkpoints in Amazon SageMaker - AWS Documentation
To resume a training job from a checkpoint, run a new estimator with the same checkpoint_s3_uri that you created in the Enable Checkpointing...
Read more >
Training Won't resume from latest checkpoint - Support
Having problems copying and pasting the tensorflow warnings/output etc I get when I start training the model as it says 'new users can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found