[Bug] Different behaviour of training HifiGan depending on number of GPUs used
See original GitHub issueDescribe the bug Running HifiGan training through distribute.py is showing different stats from running HifiGan training through train_vocoder.py
To Reproduce Steps to reproduce the behavior:
- Run single GPU: CUDA_VISIBLE_DEVICES=0 python …/…/TTS/TTS/bin/distribute.py --script train_hifigan.py
- Observe output:
--> STEP: 25/349 -- GLOBAL_STEP: 9475
| > G_l1_spec_loss: 0.28888 (0.28544)
| > G_gen_loss: 12.99941 (12.84473)
| > G_adv_loss: 0.00000 (0.00000)
| > loss_0: 12.99941 (12.84473)
| > grad_norm_0: 94.30904 (85.09504)
| > current_lr_0: 0.00048
| > current_lr_1: 0.00100
| > step_time: 0.29350 (0.29282)
| > loader_time: 0.00140 (0.00138)
- Run another command: CUDA_VISIBLE_DEVICES=0 python …/…/TTS/TTS/bin/train_vocoder.py --config_path config.json
- Observe output:
--> STEP: 150/699 -- GLOBAL_STEP: 150
| > G_l1_spec_loss: 0.65889 (0.96479)
| > G_mse_fake_loss: 0.35978 (0.37755)
| > G_feat_match_loss: 0.02501 (0.01835)
| > G_gen_loss: 29.65002 (43.41537)
| > G_adv_loss: 3.06036 (2.35913)
| > loss_0: 32.71038 (45.77450)
| > grad_norm_0: 0.00000 (0.00000)
| > D_mse_gan_loss: 0.46084 (0.54663)
| > D_mse_gan_real_loss: 0.10781 (0.08349)
| > D_mse_gan_fake_loss: 0.02431 (0.06644)
| > loss_1: 0.46084 (0.54663)
| > grad_norm_1: 0.00000 (0.00000)
| > current_lr_0: 0.00086
| > current_lr_1: 0.00086
| > step_time: 1.24380 (1.24340)
| > loader_time: 0.00150 (0.00167)
Expected behavior Both ways should work equally.
Environment (please complete the following information): OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 PyTorch or TensorFlow version (use command below): pytorch 1.10.0 Python version: 3.8.11 CUDA/cuDNN version: py3.8_cuda11.3_cudnn8.2.0_0 GPU model and memory: 2xRTX 3090
Additional context TTS verion 4.0.2.-dev
Here’s train_hifigan.py
import os
from TTS.trainer import Trainer, TrainingArgs
from TTS.utils.audio import AudioProcessor
from TTS.vocoder.configs import HifiganConfig
from TTS.vocoder.datasets.preprocess import load_wav_data
from TTS.vocoder.models.gan import GAN
output_path = os.path.dirname(os.path.abspath(__file__))
config = HifiganConfig(
batch_size=64,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=5,
epochs=1000,
seq_len=8192,
pad_short=2000,
use_noise_augment=True,
eval_split_size=10,
print_step=25,
print_eval=False,
mixed_precision=False,
lr_gen=1e-3,
lr_disc=1e-3,
data_path=os.path.join(output_path, "../datasets/vctk_all_wavs"),
output_path=output_path,
)
# init audio processor
ap = AudioProcessor(**config.audio.to_dict())
# load training samples
eval_samples, train_samples = load_wav_data(config.data_path, config.eval_split_size)
# init model
model = GAN(config)
# init the trainer and 🚀
trainer = Trainer(
TrainingArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
training_assets={"audio_processor": ap},
)
trainer.fit()
Issue Analytics
- State:
- Created 2 years ago
- Comments:13 (2 by maintainers)
Top Results From Across the Web
[BUG] HiFiGAN Training w/ Multiple GPUs · Issue #958 · coqui-ai ...
Describe the bug When using distribute.py to train HiFiGAN, you get an error message that says TypeError: get_data_loader() takes 7 positional arguments but ......
Read more >arXiv:2110.10139v2 [eess.AS] 3 Mar 2022
CARGAN features fast training, reduced pitch error, and equivalent or improved subjective quality relative to previous GAN-based models.
Read more >HiFi-GAN: High-Fidelity Denoising and Dereverberation ...
This paper introduces HiFi-GAN, a deep learning method to transform recorded ... They reduce artifacts over other waveform-based networks, ...
Read more >hifi-gan - PyPI
In our paper, we proposed HiFi-GAN: a GAN-based model capable of generating high fidelity speech efficiently.
Read more >HiFi-GAN for PyTorch - NVIDIA NGC
HiFi-GAN model implements a spectrogram inversion model that allows to synthesize ... After training, the generator is used for synthesis, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
UPDATE: actually all it matters to get different output is number of GPUs. With this command
the model trains as expected:
Thank you, indeed the training started working on 2 GPUs, but no improvement for me after 8k steps:
Here’s step 6k:
Here’s step 9k: