Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] --continue-path / resuming training from an existing job does not work

See original GitHub issue

🐛 Description

Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.

To Reproduce

Given a previous training of tacotron2 found in coqui_tts-December-19-2021_10+40PM-0000000/ ~ with the last checkpoint being checkpoint_100000.pth.tar …the following command should be expected to pick up from that last checkpoint and continue training

CUDA_VISIBLE_DEVICES=1 python ~/TTS/TTS/bin/train_tts.py --continue_path coqui_tts-December-19-2021_10+40PM-0000000/

However, instead it begins a new training job within the directory specified by --continue_path ~ beginning at GLOBAL_STEP 0

Output:

| > Found 13100 files in /its/home/jr586/datasets/tts/LJSpeech-1.1

Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:8000.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > stats_path:None
| > base:2.718281828459045
| > hop_length:256
| > win_length:1024
Using model: tacotron2
Using CUDA: True
Number of GPUs: 1

Model has 52676308 parameters

Number of output frames: 6

EPOCH: 0/3000
--> coqui_tts-December-19-2021_10+40PM-0000000/

DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 188
| > Min length sequence: 13
| > Avg length sequence: 100.90014650319993
| > Num. instances discarded by max-min (max=150, min=1) seq limits: 747
| > Batch group size: 0.

TRAINING (2021-12-24 11:57:34)
/its/home/jr586/TTS/TTS/tts/models/tacotron2.py:268: UserWarning: __floordiv__ is deprecated
, and its behavior will change in a future version of pytorch. It currently rounds toward 0
(like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative val
ues. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual
floor division, use torch.div(a, b, rounding_mode='floor').
alignment_lengths = (
/its/home/jr586/.conda/envs/ml/lib/python3.9/site-packages/torch/functional.py:445: UserWarn
ing: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argume
nt. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/
native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
--> STEP: 0/381 -- GLOBAL_STEP: 0
| > decoder_loss: 34.07442 (34.07442)
| > postnet_loss: 36.28318 (36.28318)
| > stopnet_loss: 1.43249 (1.43249)
| > decoder_coarse_loss: 34.07126 (34.07126)
| > decoder_ddc_loss: 0.00982 (0.00982)
| > ga_loss: 0.02193 (0.02193)
| > decoder_diff_spec_loss: 0.39038 (0.39038)
| > postnet_diff_spec_loss: 4.94285 (4.94285)
| > decoder_ssim_loss: 0.64225 (0.64225)
| > postnet_ssim_loss: 0.64165 (0.64165)
| > loss: 29.30612 (29.30612)
| > align_error: 0.94021 (0.94021)
| > grad_norm: 6.09056 (6.09056)
| > current_lr: 2.5000000000000002e-08
| > step_time: 0.58450 (0.58452)
| > loader_time: 0.34360 (0.34361)

This is a known bug

I’ve mentioned this on the coqui matrix chat, @WeberJulian has said that this is a known bug, and that --continue_path is not working as it should. The current fix is to modify train_tts.py to set parse_command_line_args=True when creating the Trainer (line 59)

Environment

🐸TTS Version: >>> TTS.version = 0.4.2
PyTorch Version: 1.10
Python version: 3.9
OS (e.g., Linux): ubuntu 20.04
CUDA/cuDNN version: 11.3
GPU models and configuration: NVIDIA Titan V
How you installed PyTorch (conda, pip, source): conda

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (1 by maintainers)

Top GitHub Comments

2reactions

thorstenMuellercommented, Jan 24, 2022

I can confirm this. But this has been already fixed by @WeberJulian in this PR (https://github.com/coqui-ai/TTS/commit/23d789c0722afe88f0abf3b679ee9199d877eb7a#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) on 20.12.2021. With this commit (https://github.com/coqui-ai/TTS/commit/85418ffeaa93cda22ac0be30855f55a33b64ce13#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) that change has been overwritten by @Edresson . Maybe a merge conflict.

In my case it worked by setting parse_command_line_args to true in train_tts.py.

1reaction

r7sacommented, Feb 11, 2022

One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn’t). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?