question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Models not saved during training

See original GitHub issue

Question I tried asteroid/egs/wham/DPRNN/run.sh but the error was occurred at the end of the training process. The messages are below:

~~~
sep_clean_8kmin_7101f1a8/checkpoints/_ckpt_epoch_4.ckpt as top 5
Epoch 5: 100%|██████████| 4022/4022 [28:05<00:00,  2.39it/s, loss=-11.728, v_num=0, val_loss=-11.5]
Traceback (most recent call last):
  File "train.py", line 121, in <module>
    main(arg_dic)
  File "train.py", line 92, in main
    best_path = [b for b, v in best_k.items() if v == min(best_k.values())][0]
IndexError: list index out of range
~~~

I have tried to add some codes at train.py and confirmed the length of checkpoint.best_k_models.items() is zero. And best_k_models.json contains only {}.

Does anyone have any idea to fix it? Let me know if you have any comments.

Environment

  • Python 3.7.7

  • torch 1.5.1 (I’ve tried 1.3.0 but same result)

  • pytorch-lightning 0.7.6

  • Ubuntu 18.04 on GCP

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
mparientecommented, Jul 27, 2020

Let’s keep this open until it’s merged, thanks!

0reactions
mparientecommented, Aug 23, 2020

This should be fixed in master

Read more comments on GitHub >

github_iconTop Results From Across the Web

Model not saved after training in PyTorch - Stack Overflow
I encounter the following problem. I perform an increasing cross-validation; I have 20 subjects in my dataset and try to classify images.
Read more >
Save and load models | TensorFlow Core
Model progress can be saved during and after training. This means a model can resume where it left off and avoid long training...
Read more >
custom training logic in subclassing model not saved #38103
When I save my model (Model.save) with the custom training logic and then I want to load it, the custom training loop is...
Read more >
How to Save and Load Your Keras Deep Learning Model
The weights are saved directly from the model using the save_weights() function and later loaded using the symmetrical load_weights() function.
Read more >
Saving and Loading Models - PyTorch
When saving a general checkpoint, to be used for either inference or resuming training, you must save more than just the model's state_dict....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found