Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] "RuntimeError: [!] NaN loss with loss" on GlowTTS introduction example

See original GitHub issue

Describe the bug I’m attempting to run the introductory tutorial here and getting a RuntimeError: [!] NaN loss with loss on my system, running Linux Mint 19.3 with cuda version 11.5 and pytorch 1.9.0.

To Reproduce

'''

A simple script for fitting a GlowTTS model to LJSpeech

Coopted from the coqui docs
https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html

2021 Nov 4 ~~ Jonathan Reus

'''
import os
from TTS.trainer import Trainer, TrainingArgs
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.glow_tts import GlowTTS
from TTS.utils.audio import AudioProcessor

output_path = os.path.dirname(os.path.abspath(__file__))
datasets_dir = os.path.abspath(os.path.join(output_path, "../../../datasets"))
dataset_config = BaseDatasetConfig(
    name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(datasets_dir, "LJSpeech-1.1")
)

config = GlowTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
)

ap = AudioProcessor(**config.audio.to_dict())
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
model = GlowTTS(config, speaker_manager=None)
trainer = Trainer(
    TrainingArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
    training_assets={"audio_processor": ap},  # assets are objetcs used by the models but not class members.
)
trainer.fit()

and then…

$ python train.py 
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 | > Found 13100 files in datasets/LJSpeech-1.1
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
 > Using CUDA:  True
 > Number of GPUs:  1

 > Model has 28610065 parameters

 > EPOCH: 0/1000
 --> glowSimpleTTS/coqui_tts-November-04-2021_05+32PM-0000000

 > DataLoader initialization
 | > Use phonemes: True
   | > phoneme language: en-us
 | > Number of instances : 12969
 | > Max length sequence: 188
 | > Min length sequence: 13
 | > Avg length sequence: 100.90014650319993
 | > Num. instances discarded by max-min (max=500, min=3) seq limits: 0
 | > Batch group size: 0.

 > TRAINING (2021-11-04 17:32:23) 
/miniconda3/envs/data/lib/python3.9/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

   --> STEP: 0/405 -- GLOBAL_STEP: 0
     | > current_lr: 2.5e-07 
     | > step_time: 1.49580  (1.49579)
     | > loader_time: 0.26150  (0.26146)

 ! Run is removed from /glowSimpleTTS/coqui_tts-November-04-2021_05+32PM-0000000
Traceback (most recent call last):
  File "/TTS/TTS/trainer.py", line 1007, in fit
    self._fit()
  File "/TTS/TTS/trainer.py", line 992, in _fit
    self.train_epoch()
  File "/TTS/TTS/trainer.py", line 820, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/TTS/TTS/trainer.py", line 690, in train_step
    outputs, loss_dict_new, step_time = self._optimize(
  File "/TTS/TTS/trainer.py", line 601, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "/TTS/TTS/trainer.py", line 560, in _model_train_step
    return model.train_step(*input_args)
  File "/TTS/TTS/tts/models/glow_tts.py", line 381, in train_step
    loss_dict = criterion(
  File "/miniconda3/envs/data/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/TTS/TTS/tts/layers/losses.py", line 437, in forward
    raise RuntimeError(f" [!] NaN loss with {key}.")
RuntimeError:  [!] NaN loss with loss.

Environment (please complete the following information):

Linux Mint 19.3
Pytorch 1.9.0
Python 3.9.6
CUDA cuda_11.5.r11.5/compiler.30411180_0
CUDNN 8.3
GPU: NVIDIA GeForce GTX 1650 with Max-Q Design / Memory 4096 MB / CUDA cores 1024

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

jreuscommented, Nov 20, 2021

@WeberJulian got it, thank you! Although I read on the coqui FAQ that batch sizes < 32 tend not to converge. I wonder if the coqui team would consider providing a set of “recommended hardware specs” for training and running the coqui models? I for one would find this very helpful!

0reactions

WeberJuliancommented, Nov 20, 2021

I think that issue is specific to Tacotron, GlowTTS doesn’t have any issues with alignment convergence. It’s hard to pinpoint minimum specs for each model since the config might affect the hardware requirements. But you can do most of the work on colab.