[Bug] --continue-path / resuming training from an existing job does not work
See original GitHub issue🐛 Description
Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.
To Reproduce
Given a previous training of tacotron2 found in coqui_tts-December-19-2021_10+40PM-0000000/
~ with the last checkpoint being checkpoint_100000.pth.tar
…the following command should be expected to pick up from that last checkpoint and continue training
CUDA_VISIBLE_DEVICES=1 python ~/TTS/TTS/bin/train_tts.py --continue_path coqui_tts-December-19-2021_10+40PM-0000000/
However, instead it begins a new training job within the directory specified by --continue_path
~ beginning at GLOBAL_STEP 0
Output:
| > Found 13100 files in /its/home/jr586/datasets/tts/LJSpeech-1.1
Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:8000.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > stats_path:None
| > base:2.718281828459045
| > hop_length:256
| > win_length:1024
Using model: tacotron2
Using CUDA: True
Number of GPUs: 1
Model has 52676308 parameters
Number of output frames: 6
EPOCH: 0/3000
--> coqui_tts-December-19-2021_10+40PM-0000000/
DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 188
| > Min length sequence: 13
| > Avg length sequence: 100.90014650319993
| > Num. instances discarded by max-min (max=150, min=1) seq limits: 747
| > Batch group size: 0.
TRAINING (2021-12-24 11:57:34)
/its/home/jr586/TTS/TTS/tts/models/tacotron2.py:268: UserWarning: __floordiv__ is deprecated
, and its behavior will change in a future version of pytorch. It currently rounds toward 0
(like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative val
ues. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual
floor division, use torch.div(a, b, rounding_mode='floor').
alignment_lengths = (
/its/home/jr586/.conda/envs/ml/lib/python3.9/site-packages/torch/functional.py:445: UserWarn
ing: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argume
nt. (Triggered internally at /opt/conda/conda-bld/pytorch_1634272204863/work/aten/src/ATen/
native/TensorShape.cpp:2157.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
--> STEP: 0/381 -- GLOBAL_STEP: 0
| > decoder_loss: 34.07442 (34.07442)
| > postnet_loss: 36.28318 (36.28318)
| > stopnet_loss: 1.43249 (1.43249)
| > decoder_coarse_loss: 34.07126 (34.07126)
| > decoder_ddc_loss: 0.00982 (0.00982)
| > ga_loss: 0.02193 (0.02193)
| > decoder_diff_spec_loss: 0.39038 (0.39038)
| > postnet_diff_spec_loss: 4.94285 (4.94285)
| > decoder_ssim_loss: 0.64225 (0.64225)
| > postnet_ssim_loss: 0.64165 (0.64165)
| > loss: 29.30612 (29.30612)
| > align_error: 0.94021 (0.94021)
| > grad_norm: 6.09056 (6.09056)
| > current_lr: 2.5000000000000002e-08
| > step_time: 0.58450 (0.58452)
| > loader_time: 0.34360 (0.34361)
This is a known bug
I’ve mentioned this on the coqui matrix chat, @WeberJulian has said that this is a known bug, and that --continue_path is not working as it should.
The current fix is to modify train_tts.py
to set parse_command_line_args=True
when creating the Trainer (line 59)
Environment
- 🐸TTS Version: >>> TTS.version = 0.4.2
- PyTorch Version: 1.10
- Python version: 3.9
- OS (e.g., Linux): ubuntu 20.04
- CUDA/cuDNN version: 11.3
- GPU models and configuration: NVIDIA Titan V
- How you installed PyTorch (
conda
,pip
, source): conda
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (1 by maintainers)
Top Results From Across the Web
continue-path / resuming training from an existing job does not ...
Description. Attempting to pick up a previous/cancelled training session from the last checkpoint does not work as expected.
Read more >"Your Azure credentials have not been set up or have expired ...
When passing the current Azure context to the Start-Job command, the first job that completes will often fail with the error message, "Your...
Read more >Hey, Scripting Guy! How Can I Use the –ErrorAction ...
Hey, Scripting Guy! I enjoyed reading your blog post yesterday about using $ErrorActionPreference to control the behavior of a script.
Read more >Our Accessibility Journey | Rise Help Center
The whole onboarding experience and course carousel are fully accessible, with proper heading levels and a skip link. The Learn tab is fully...
Read more >Routing Configuration Guide for Cisco NCS 5000 Series ...
on the router, it notifies routers in the area that the router is not available for transit traffic. This capability is.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I can confirm this. But this has been already fixed by @WeberJulian in this PR (https://github.com/coqui-ai/TTS/commit/23d789c0722afe88f0abf3b679ee9199d877eb7a#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) on 20.12.2021. With this commit (https://github.com/coqui-ai/TTS/commit/85418ffeaa93cda22ac0be30855f55a33b64ce13#diff-18ac0d5b5b29ace6dde3a4dc7a18ef3822992377781e2e32c058dc4270d7a1c9) that change has been overwritten by @Edresson . Maybe a merge conflict.
In my case it worked by setting
parse_command_line_args
totrue
intrain_tts.py
.One more question. As I understand - continue_path is for continue after some crash (for example by power) and restore_path - for start new train from existing model. When we use continue_path model is restored by use restore_path functionality (restore_model function), and next code: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L449-L455 always drop learn rate to start value. To correct this, here: https://github.com/coqui-ai/TTS/blob/main/TTS/trainer.py#L312-L317 .last_epoch sets to last global step. But it is not correct/work. Not all schedulers estimate LR by last_epoch (for example torch.optim.lr_scheduler.ExponentialLR - doesn’t). Moreover if scheduler_after_epoch used, then LR does not directly depend on current global step. It seems that right way is to save scheduler(s) state into checkpoint (by method state_dict), and restore in lines L312-L317 (by method load_state_dict). Is it correct?