question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Different behaviour of training HifiGan depending on number of GPUs used

See original GitHub issue

Describe the bug Running HifiGan training through distribute.py is showing different stats from running HifiGan training through train_vocoder.py

To Reproduce Steps to reproduce the behavior:

  1. Run single GPU: CUDA_VISIBLE_DEVICES=0 python …/…/TTS/TTS/bin/distribute.py --script train_hifigan.py
  2. Observe output:
--> STEP: 25/349 -- GLOBAL_STEP: 9475
     | > G_l1_spec_loss: 0.28888  (0.28544)
     | > G_gen_loss: 12.99941  (12.84473)
     | > G_adv_loss: 0.00000  (0.00000)
     | > loss_0: 12.99941  (12.84473)
     | > grad_norm_0: 94.30904  (85.09504)
     | > current_lr_0: 0.00048 
     | > current_lr_1: 0.00100 
     | > step_time: 0.29350  (0.29282)
     | > loader_time: 0.00140  (0.00138)
  1. Run another command: CUDA_VISIBLE_DEVICES=0 python …/…/TTS/TTS/bin/train_vocoder.py --config_path config.json
  2. Observe output:
 --> STEP: 150/699 -- GLOBAL_STEP: 150
     | > G_l1_spec_loss: 0.65889  (0.96479)
     | > G_mse_fake_loss: 0.35978  (0.37755)
     | > G_feat_match_loss: 0.02501  (0.01835)
     | > G_gen_loss: 29.65002  (43.41537)
     | > G_adv_loss: 3.06036  (2.35913)
     | > loss_0: 32.71038  (45.77450)
     | > grad_norm_0: 0.00000  (0.00000)
     | > D_mse_gan_loss: 0.46084  (0.54663)
     | > D_mse_gan_real_loss: 0.10781  (0.08349)
     | > D_mse_gan_fake_loss: 0.02431  (0.06644)
     | > loss_1: 0.46084  (0.54663)
     | > grad_norm_1: 0.00000  (0.00000)
     | > current_lr_0: 0.00086 
     | > current_lr_1: 0.00086 
     | > step_time: 1.24380  (1.24340)
     | > loader_time: 0.00150  (0.00167)


Expected behavior Both ways should work equally.

Environment (please complete the following information): OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 PyTorch or TensorFlow version (use command below): pytorch 1.10.0 Python version: 3.8.11 CUDA/cuDNN version: py3.8_cuda11.3_cudnn8.2.0_0 GPU model and memory: 2xRTX 3090

Additional context TTS verion 4.0.2.-dev

Here’s train_hifigan.py

import os

from TTS.trainer import Trainer, TrainingArgs
from TTS.utils.audio import AudioProcessor
from TTS.vocoder.configs import HifiganConfig
from TTS.vocoder.datasets.preprocess import load_wav_data
from TTS.vocoder.models.gan import GAN

output_path = os.path.dirname(os.path.abspath(__file__))

config = HifiganConfig(
    batch_size=64,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=5,
    epochs=1000,
    seq_len=8192,
    pad_short=2000,
    use_noise_augment=True,
    eval_split_size=10,
    print_step=25,
    print_eval=False,
    mixed_precision=False,
    lr_gen=1e-3,
    lr_disc=1e-3,
    data_path=os.path.join(output_path, "../datasets/vctk_all_wavs"),
    output_path=output_path,
)

# init audio processor
ap = AudioProcessor(**config.audio.to_dict())

# load training samples
eval_samples, train_samples = load_wav_data(config.data_path, config.eval_split_size)


# init model
model = GAN(config)

# init the trainer and 🚀
trainer = Trainer(
    TrainingArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
    training_assets={"audio_processor": ap},
)
trainer.fit()

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:13 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
skol101commented, Nov 30, 2021

UPDATE: actually all it matters to get different output is number of GPUs. With this command

 CUDA_VISIBLE_DEVICES="0" python ../../TTS/TTS/bin/distribute.py --script ../../TTS/TTS/bin/train_vocoder.py --config_path config.json
['../../TTS/TTS/bin/train_vocoder.py', '--continue_path=', '--restore_path=', '--config_path=config.json', '--group_id=group_2021_11_30-141550', '--use_ddp=true', '--rank=0']

the model trains as expected:

 --> STEP: 225/699 -- GLOBAL_STEP: 225
     | > G_l1_spec_loss: 0.51882  (0.84050)
     | > G_mse_fake_loss: 0.31469  (0.38687)
     | > G_feat_match_loss: 0.01629  (0.01684)
     | > G_gen_loss: 23.34689  (37.82264)
     | > G_adv_loss: 2.07442  (2.20507)
     | > loss_0: 25.42131  (40.02770)
     | > grad_norm_0: 0.00000  (0.00000)
     | > D_mse_gan_loss: 0.45587  (0.55319)
     | > D_mse_gan_real_loss: 0.06953  (0.09215)
     | > D_mse_gan_fake_loss: 0.05403  (0.07818)
     | > loss_1: 0.45587  (0.55319)
     | > grad_norm_1: 0.00000  (0.00000)
     | > current_lr_0: 0.00080 
     | > current_lr_1: 0.00080 
     | > step_time: 1.23120  (1.23022)
     | > loader_time: 0.00170  (0.00163)

0reactions
skol101commented, Dec 1, 2021

Thank you, indeed the training started working on 2 GPUs, but no improvement for me after 8k steps:

Here’s step 6k:

 --> STEP: 325/349 -- GLOBAL_STEP: 5925
     | > G_l1_spec_loss: 0.35088  (0.33620)
     | > G_mse_fake_loss: 0.39742  (0.38448)
     | > G_feat_match_loss: 0.05728  (0.05410)
     | > G_gen_loss: 15.78973  (15.12918)
     | > G_adv_loss: 6.58330  (6.22701)
     | > loss_0: 22.37303  (21.35619)
     | > grad_norm_0: 0.00000  (0.00000)
     | > D_mse_gan_loss: 0.38689  (0.39288)
     | > D_mse_gan_real_loss: 0.04797  (0.04380)
     | > D_mse_gan_fake_loss: 0.03050  (0.03773)
     | > loss_1: 0.38689  (0.39288)
     | > grad_norm_1: 0.00000  (0.00000)
     | > current_lr_0: 2.6612350037403064e-06 
     | > current_lr_1: 2.6612350037403064e-06 
     | > step_time: 1.21690  (1.21557)
     | > loader_time: 0.00150  (0.00175)


 > EVALUATION 


  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.00036 (+0.00003)
     | > avg_G_l1_spec_loss: 0.32643 (-0.00004)
     | > avg_G_mse_fake_loss: 0.36232 (+0.01273)
     | > avg_G_feat_match_loss: 0.05402 (+0.00055)
     | > avg_G_gen_loss: 14.68919 (-0.00173)
     | > avg_G_adv_loss: 6.19600 (+0.07220)
     | > avg_loss_0: 20.88519 (+0.07047)
     | > avg_D_mse_gan_loss: 0.41815 (+0.00291)
     | > avg_D_mse_gan_real_loss: 0.03527 (+0.00155)
     | > avg_D_mse_gan_fake_loss: 0.04699 (-0.00054)
     | > avg_loss_1: 0.41815 (+0.00291)

Here’s step 9k:

 --> STEP: 325/349 -- GLOBAL_STEP: 9775
     | > G_l1_spec_loss: 0.31023  (0.32699)
     | > G_mse_fake_loss: 0.37565  (0.37749)
     | > G_feat_match_loss: 0.04670  (0.05104)
     | > G_gen_loss: 13.96039  (14.71434)
     | > G_adv_loss: 5.41960  (5.89001)
     | > loss_0: 19.37999  (20.60435)
     | > grad_norm_0: 0.00000  (0.00000)
     | > D_mse_gan_loss: 0.39573  (0.39730)
     | > D_mse_gan_real_loss: 0.04231  (0.04481)
     | > D_mse_gan_fake_loss: 0.03292  (0.03838)
     | > loss_1: 0.39573  (0.39730)
     | > grad_norm_1: 0.00000  (0.00000)
     | > current_lr_0: 5.6521398267571595e-08 
     | > current_lr_1: 5.6521398267571595e-08 
     | > step_time: 1.21540  (1.21563)
     | > loader_time: 0.00150  (0.00164)


 > EVALUATION 


  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.00039 (+0.00005)
     | > avg_G_l1_spec_loss: 0.32615 (+0.00004)
     | > avg_G_mse_fake_loss: 0.36399 (-0.00736)
     | > avg_G_feat_match_loss: 0.05499 (-0.00002)
     | > avg_G_gen_loss: 14.67659 (+0.00172)
     | > avg_G_adv_loss: 6.30315 (-0.00954)
     | > avg_loss_0: 20.97974 (-0.00782)
     | > avg_D_mse_gan_loss: 0.41710 (+0.00061)
     | > avg_D_mse_gan_real_loss: 0.03614 (-0.00112)
     | > avg_D_mse_gan_fake_loss: 0.04358 (+0.00183)
     | > avg_loss_1: 0.41710 (+0.00061)

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] HiFiGAN Training w/ Multiple GPUs · Issue #958 · coqui-ai ...
Describe the bug When using distribute.py to train HiFiGAN, you get an error message that says TypeError: get_data_loader() takes 7 positional arguments but ......
Read more >
arXiv:2110.10139v2 [eess.AS] 3 Mar 2022
CARGAN features fast training, reduced pitch error, and equivalent or improved subjective quality relative to previous GAN-based models.
Read more >
HiFi-GAN: High-Fidelity Denoising and Dereverberation ...
This paper introduces HiFi-GAN, a deep learning method to transform recorded ... They reduce artifacts over other waveform-based networks, ...
Read more >
hifi-gan - PyPI
In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
Read more >
HiFi-GAN for PyTorch - NVIDIA NGC
HiFi-GAN model implements a spectrogram inversion model that allows to synthesize ... After training, the generator is used for synthesis, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found