question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] "RuntimeError: [!] NaN loss with loss" on GlowTTS introduction example

See original GitHub issue

Describe the bug I’m attempting to run the introductory tutorial here and getting a RuntimeError: [!] NaN loss with loss on my system, running Linux Mint 19.3 with cuda version 11.5 and pytorch 1.9.0.

To Reproduce

'''

A simple script for fitting a GlowTTS model to LJSpeech

Coopted from the coqui docs
https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html

2021 Nov 4 ~~ Jonathan Reus

'''
import os
from TTS.trainer import Trainer, TrainingArgs
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.glow_tts import GlowTTS
from TTS.utils.audio import AudioProcessor

output_path = os.path.dirname(os.path.abspath(__file__))
datasets_dir = os.path.abspath(os.path.join(output_path, "../../../datasets"))
dataset_config = BaseDatasetConfig(
    name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(datasets_dir, "LJSpeech-1.1")
)

config = GlowTTSConfig(
    batch_size=32,
    eval_batch_size=16,
    num_loader_workers=4,
    num_eval_loader_workers=4,
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1000,
    text_cleaner="phoneme_cleaners",
    use_phonemes=True,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
    print_step=25,
    print_eval=False,
    mixed_precision=True,
    output_path=output_path,
    datasets=[dataset_config],
)

ap = AudioProcessor(**config.audio.to_dict())
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
model = GlowTTS(config, speaker_manager=None)
trainer = Trainer(
    TrainingArgs(),
    config,
    output_path,
    model=model,
    train_samples=train_samples,
    eval_samples=eval_samples,
    training_assets={"audio_processor": ap},  # assets are objetcs used by the models but not class members.
)
trainer.fit()

and then…

$ python train.py 
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:45
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 | > Found 13100 files in datasets/LJSpeech-1.1
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
 > Using CUDA:  True
 > Number of GPUs:  1

 > Model has 28610065 parameters

 > EPOCH: 0/1000
 --> glowSimpleTTS/coqui_tts-November-04-2021_05+32PM-0000000

 > DataLoader initialization
 | > Use phonemes: True
   | > phoneme language: en-us
 | > Number of instances : 12969
 | > Max length sequence: 188
 | > Min length sequence: 13
 | > Avg length sequence: 100.90014650319993
 | > Num. instances discarded by max-min (max=500, min=3) seq limits: 0
 | > Batch group size: 0.

 > TRAINING (2021-11-04 17:32:23) 
/miniconda3/envs/data/lib/python3.9/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448255797/work/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

   --> STEP: 0/405 -- GLOBAL_STEP: 0
     | > current_lr: 2.5e-07 
     | > step_time: 1.49580  (1.49579)
     | > loader_time: 0.26150  (0.26146)

 ! Run is removed from /glowSimpleTTS/coqui_tts-November-04-2021_05+32PM-0000000
Traceback (most recent call last):
  File "/TTS/TTS/trainer.py", line 1007, in fit
    self._fit()
  File "/TTS/TTS/trainer.py", line 992, in _fit
    self.train_epoch()
  File "/TTS/TTS/trainer.py", line 820, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/TTS/TTS/trainer.py", line 690, in train_step
    outputs, loss_dict_new, step_time = self._optimize(
  File "/TTS/TTS/trainer.py", line 601, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "/TTS/TTS/trainer.py", line 560, in _model_train_step
    return model.train_step(*input_args)
  File "/TTS/TTS/tts/models/glow_tts.py", line 381, in train_step
    loss_dict = criterion(
  File "/miniconda3/envs/data/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/TTS/TTS/tts/layers/losses.py", line 437, in forward
    raise RuntimeError(f" [!] NaN loss with {key}.")
RuntimeError:  [!] NaN loss with loss.

Environment (please complete the following information):

  • Linux Mint 19.3
  • Pytorch 1.9.0
  • Python 3.9.6
  • CUDA cuda_11.5.r11.5/compiler.30411180_0
  • CUDNN 8.3
  • GPU: NVIDIA GeForce GTX 1650 with Max-Q Design / Memory 4096 MB / CUDA cores 1024

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
jreuscommented, Nov 20, 2021

@WeberJulian got it, thank you! Although I read on the coqui FAQ that batch sizes < 32 tend not to converge. I wonder if the coqui team would consider providing a set of “recommended hardware specs” for training and running the coqui models? I for one would find this very helpful!

0reactions
WeberJuliancommented, Nov 20, 2021

I think that issue is specific to Tacotron, GlowTTS doesn’t have any issues with alignment convergence. It’s hard to pinpoint minimum specs for each model since the config might affect the hardware requirements. But you can do most of the work on colab.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech ...
Describe the bug When I train Glow TTS on LJSpeech Spanish set (angelina ... "RuntimeError: [!] NaN loss with loss" on GlowTTS introduction...
Read more >
Deep-Learning Nan loss reasons - python - Stack Overflow
You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are...
Read more >
Re-training SSD-Mobilenet: gt_locations consist of nan values ...
While training, my Avg Loss is reducing slowly but suddenly I'm getting NaN. I followed the following methods but the issue still persists....
Read more >
1.1.8 PDF - PyTorch Lightning Documentation
In this guide we'll show you how to organize your PyTorch code into Lightning in 2 steps. Organizing your code with PyTorch Lightning...
Read more >
Torch.clamp backward got nan values - PyTorch Forums
In my codes, I have used torch.clamp as follows: epsilon = 1e-6 ypred = torch.clamp(ypred, epsilon, 1-epsilon) and got error message as ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found