question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] The model does not train with the new hyperparameters given by command line when trying to restart a training with `restore_path` and `continue_path`

See original GitHub issue

Describe the bug

I am trying to continue the training of a multi-speaker VITS model in Catalan with 4 16GB V100 GPUs.

I want to try to modify some different hyperparameters (like the learnig rate) to find the most optimal configuration. When launching the new trainings with the --restore_path argument and other hyperparameter arguments, a new config is created with the updated hyperparameters. However, in the training, the model does not use these new hyperparameters, but uses the same ones that appeared in the original model config.

In the “to reproduce” section I attach the config of the original training and the config, the logs and the command line used to run the new training.

Regarding the --contine_path argument, when wanting to continue the training from the same point where the training stopped, the model resets the learning rate to that of the original config.

As in both cases the behavior is the same (using the parameters of the original config ignoring the new ones passed by command line) I thought it appropriate to mention them in the same issue.

To Reproduce

Original config: config.txt

New generated config: config.txt logs of the new training: trainer_0_log.txt The above logs show current_lr_0: 0.00050 current_lr_1: 0.00050:

   --> STEP: 24/1620 -- GLOBAL_STEP: 170025
     | > loss_disc: 2.35827  (2.46076)
     | > loss_disc_real_0: 0.14623  (0.14530)
     | > loss_disc_real_1: 0.23082  (0.20939)
     | > loss_disc_real_2: 0.22020  (0.21913)
     | > loss_disc_real_3: 0.19430  (0.22623)
     | > loss_disc_real_4: 0.21045  (0.22390)
     | > loss_disc_real_5: 0.20165  (0.23435)
     | > loss_0: 2.35827  (2.46076)
     | > grad_norm_0: 24.36758  (16.55595)
     | > loss_gen: 2.37695  (2.37794)
     | > loss_kl: 2.56117  (2.30560)
     | > loss_feat: 9.57505  (8.38634)
     | > loss_mel: 22.84378  (22.47223)
     | > loss_duration: 1.59958  (1.55717)
     | > loss_1: 38.95654  (37.09929)
     | > grad_norm_1: 192.16046  (145.46979)
     | > current_lr_0: 0.00050 
     | > current_lr_1: 0.00050 
     | > step_time: 0.96620  (1.22051)
     | > loader_time: 0.00510  (0.00600)

Below I attach the command line used to launch the new training:

export RECIPE="${RUN_DIR}/recipes/multispeaker/vits/experiments/train_vits_ca.py"
export RESTORE="${RUN_DIR}/recipes/multispeaker/vits/experiments/checkpoint_vits_170000.pth"

python -m trainer.distribute --script ${RECIPE} -gpus "0,1,2,3" \
--restore_path ${RESTORE} --coqpit.lr_gen 0.0002 --coqpit.lr_disc 0.0002 \
--coqpit.eval_batch_size 8 --coqpit.epochs 4 --coqpit.batch_size 16

Expected behavior

No response

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB",
            "Tesla V100-SXM2-16GB"
        ],
        "available": true,
        "version": "10.2"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.9.0a0+git3d70ab0",
        "TTS": "0.6.2",
        "numpy": "1.19.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "ppc64le",
        "python": "3.7.4",
        "version": "#1 SMP Tue Sep 25 12:28:39 EDT 2018"
    }
}

Additional context

Trainer was updated to trainer==0.0.13. Please let me know if you need more information and thank you in advance.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:17 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
erogolcommented, Sep 26, 2022

When you restore the model you also restore the scheduler and it overrides what you define on the terminal probably? @loganhart420 can you check if it is the case?

1reaction
loganhart420commented, Sep 21, 2022

Hi @loganhart420 I am the colleague of @GerrySant. In the end we restructured our data in vctk_old format and launched some processes using the train_tts.py, and we still have the same problem, i.e. the stderr shows that the lr_gen and lr_disc used are not consistent with the value coming from coqpit. This time we tried it both for v0.6.2 and v0.8.0.

Although the results are the same for all (the initially shared configs and the new two) I am attaching the input and output configs plus the log for the process launched using TTS v0.8.0.

For the command:

export RUN_DIR=./TTS_v0.8.0
module purge
source $RUN_DIR/use_venv.sh

export RECIPE=${RUN_DIR}/TTS/bin/train_tts.py
export CONFIG=${RUN_DIR}/recipes/multispeaker/config_experiments/config_mixed.json
export RESTORE=${RUN_DIR}/../TTS/recipes/multispeaker/vits/config_experiments/best_model.pth

CUDA_VISIBLE_DEVICES="0" python ${RECIPE} --config_path ${CONFIG} --restore_path ${RESTORE} \
                                          --coqpit.lr_disc 0.0001 --coqpit.lr_gen 0.0001 \
                                          --coqpit.batch_size 32

files: trainer_0_log.txt config_input.txt config_output.txt

Thanks for letting me know, I’ll run the same the setup and look into it

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found