question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] --continue-path / resuming training from an existing job does not work

See original GitHub issue

🐛 Description

Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.

To Reproduce

Given a previous training of tacotron2 found in coqui_tts-December-19-2021_10+40PM-0000000/ ~ with the last checkpoint being checkpoint_100000.pth.tar …the following command should be expected to pick up from that last checkpoint and continue training

CUDA_VISIBLE_DEVICES=1 python ~/TTS/TTS/bin/train_tts.py --continue_path coqui_tts-December-19-2021_10+40PM-0000000/

However, instead it begins a new training job within the directory specified by --continue_path ~ beginning at GLOBAL_STEP 0

Output:

| > Found 13100 files in /its/home/jr586/datasets/tts/LJSpeech-1.1

Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:8000.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > stats_path:None
| > base:2.718281828459045
| > hop_length:256
| > win_length:1024
Using model: tacotron2
Using CUDA: True
Number of GPUs: 1

Model has 52676308 parameters

Number of output frames: 6

EPOCH: 0/3000
--> coqui_tts-December-19-2021_10+40PM-0000000/

DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 188
| > Min length sequence: 13
| > Avg length sequence: 100.90014650319993
| > Num. instances discarded by max-min (max=150, min=1) seq limits: 747
| > Batch group size: 0.

TRAINING (2021-12-24 11:57:34)
/its/home/jr586/TTS/TTS/tts/models/tacotron2.py:268: UserWarning: __floordiv__ is deprecated
, and its behavior will change in a future version of pytorch. It currently rounds toward 0
(like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative val
ues. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual
floor division, use torch.div(a, b, rounding_mode='floor').
alignment_lengths = (
/its/home/jr586/.conda/envs/ml/lib/python3.9/site-packages/torch/functional.py:445: UserWarn
ing: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argume
nt. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/
native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
--> STEP: 0/381 -- GLOBAL_STEP: 0
| > decoder_loss: 34.07442 (34.07442)
| > postnet_loss: 36.28318 (36.28318)
| > stopnet_loss: 1.43249 (1.43249)
| > decoder_coarse_loss: 34.07126 (34.07126)
| > decoder_ddc_loss: 0.00982 (0.00982)
| > ga_loss: 0.02193 (0.02193)
| > decoder_diff_spec_loss: 0.39038 (0.39038)
| > postnet_diff_spec_loss: 4.94285 (4.94285)
| > decoder_ssim_loss: 0.64225 (0.64225)
| > postnet_ssim_loss: 0.64165 (0.64165)
| > loss: 29.30612 (29.30612)
| > align_error: 0.94021 (0.94021)
| > grad_norm: 6.09056 (6.09056)
| > current_lr: 2.5000000000000002e-08
| > step_time: 0.58450 (0.58452)
| > loader_time: 0.34360 (0.34361)

This is a known bug

I’ve mentioned this on the coqui matrix chat, @WeberJulian has said that this is a known bug, and that --continue_path is not working as it should. The current fix is to modify train_tts.py to set parse_command_line_args=True when creating the Trainer (line 59)

Environment

  • 🐸TTS Version: >>> TTS.version = 0.4.2
  • PyTorch Version: 1.10
  • Python version: 3.9
  • OS (e.g., Linux): ubuntu 20.04
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: NVIDIA Titan V
  • How you installed PyTorch (conda, pip, source): conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
thorstenMuellercommented, Jan 24, 2022

I can confirm this. But this has been already fixed by @WeberJulian in this PR (https://github.com/coqui-ai/TTS/commit/23d789c0722afe88f0abf3b679ee9199d877eb7a#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) on 20.12.2021. With this commit (https://github.com/coqui-ai/TTS/commit/85418ffeaa93cda22ac0be30855f55a33b64ce13#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) that change has been overwritten by @Edresson . Maybe a merge conflict.

In my case it worked by setting parse_command_line_args to true in train_tts.py.

1reaction
r7sacommented, Feb 11, 2022

One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn’t). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?

Read more comments on GitHub >

github_iconTop Results From Across the Web

continue-path / resuming training from an existing job does not ...
Description. Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.
Read more >
"Your Azure credentials have not been set up or have expired ...
When passing the current Azure context to the Start-Job command, the first job that completes will often fail with the error message, "Your...
Read more >
Hey, Scripting Guy! How Can I Use the –ErrorAction ...
Hey, Scripting Guy! I enjoyed reading your blog post yesterday about using $ErrorActionPreference to control the behavior of a script.
Read more >
Our Accessibility Journey | Rise Help Center
The whole onboarding experience and course carousel are fully accessible, with proper heading levels and a skip link. The Learn tab is fully...
Read more >
Routing Configuration Guide for Cisco NCS 5000 Series ...
on the router, it notifies routers in the area that the router is not available for transit traffic. This capability is.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found