Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

have trouble about resume from ckpt

See original GitHub issue

I try to train DPRNN and DPTNet on wsj0-3mix from a checkpoint. and val_loss look like this (blue line is train from a ckpt). I think it should be a curve much smoother. snap

Scripts were copied from wham folder. to resume from ckpt I add some code in train.py like this:

state_dict = torch.load(os.path.join(exp_dir, ckpt_name))   
system.load_state_dict(state_dict['state_dict'])
...
trainer = pl.Trainer(..., resume_from_checkpoint=os.path.join(exp_dir, ckpt_name))

And to train DPRNN & DPTNet on wsj0-3mix, I replace wham dataset & dataloader code in train.py to this:

train_loader, val_loader = make_dataloaders(n_src=3, **conf["data"], **conf["training"])

no change in conf.yml. I also tried deepclustering on wsj0-3mix, dprnn & dptnet on wham and my own dataset, did not find this problem.

what did I miss? Need help plz.

Issue Analytics

State:
Created 3 years ago
Comments:15 (3 by maintainers)

Top GitHub Comments

2reactions

popcornellcommented, Nov 10, 2020

I have a pull request which should solve this issue. Need feedback: https://github.com/mpariente/asteroid/pull/326

What about this difference:

definitely one of the issues but not the main one changed it to the torch one

0reactions

mparientecommented, Nov 10, 2020

Cool ! The other question is: is the scheduler’s resumed when using resume_from_checkpoint?

What about this difference:

# In torch
def state_dict(self):
    return {key: value for key, value in self.__dict__.items() if key != 'optimizer'}
# In asteroid
def state_dict(self):
    return {key: value for key, value in self.__dict__.items()}

I’m not sure the optimizer should be in the state_dict…