[Bug] "RuntimeError: [!] NaN loss with loss" on GlowTTS introduction example
See original GitHub issueDescribe the bug
I’m attempting to run the introductory tutorial here and getting a RuntimeError: [!] NaN loss with loss
on my system, running Linux Mint 19.3 with cuda version 11.5 and pytorch 1.9.0.
To Reproduce
'''
A simple script for fitting a GlowTTS model to LJSpeech
Coopted from the coqui docs
https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html
2021 Nov 4 ~~ Jonathan Reus
'''
import os
from TTS.trainer import Trainer, TrainingArgs
from TTS.tts.configs.glow_tts_config import GlowTTSConfig
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.glow_tts import GlowTTS
from TTS.utils.audio import AudioProcessor
output_path = os.path.dirname(os.path.abspath(__file__))
datasets_dir = os.path.abspath(os.path.join(output_path, "../../../datasets"))
dataset_config = BaseDatasetConfig(
name="ljspeech", meta_file_train="metadata.csv", path=os.path.join(datasets_dir, "LJSpeech-1.1")
)
config = GlowTTSConfig(
batch_size=32,
eval_batch_size=16,
num_loader_workers=4,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="phoneme_cleaners",
use_phonemes=True,
phoneme_language="en-us",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache"),
print_step=25,
print_eval=False,
mixed_precision=True,
output_path=output_path,
datasets=[dataset_config],
)
ap = AudioProcessor(**config.audio.to_dict())
train_samples, eval_samples = load_tts_samples(dataset_config, eval_split=True)
model = GlowTTS(config, speaker_manager=None)
trainer = Trainer(
TrainingArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
training_assets={"audio_processor": ap}, # assets are objetcs used by the models but not class members.
)
trainer.fit()
and then…
$ python train.py
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:45
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
| > Found 13100 files in datasets/LJSpeech-1.1
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
fatal: not a git repository (or any parent up to mount point /)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
> Using CUDA: True
> Number of GPUs: 1
> Model has 28610065 parameters
> EPOCH: 0/1000
--> glowSimpleTTS/coqui_tts-November-04-2021_05+32PM-0000000
> DataLoader initialization
| > Use phonemes: True
| > phoneme language: en-us
| > Number of instances : 12969
| > Max length sequence: 188
| > Min length sequence: 13
| > Avg length sequence: 100.90014650319993
| > Num. instances discarded by max-min (max=500, min=3) seq limits: 0
| > Batch group size: 0.
> TRAINING (2021-11-04 17:32:23)
/miniconda3/envs/data/lib/python3.9/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /opt/conda/conda-bld/pytorch_1623448255797/work/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
--> STEP: 0/405 -- GLOBAL_STEP: 0
| > current_lr: 2.5e-07
| > step_time: 1.49580 (1.49579)
| > loader_time: 0.26150 (0.26146)
! Run is removed from /glowSimpleTTS/coqui_tts-November-04-2021_05+32PM-0000000
Traceback (most recent call last):
File "/TTS/TTS/trainer.py", line 1007, in fit
self._fit()
File "/TTS/TTS/trainer.py", line 992, in _fit
self.train_epoch()
File "/TTS/TTS/trainer.py", line 820, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/TTS/TTS/trainer.py", line 690, in train_step
outputs, loss_dict_new, step_time = self._optimize(
File "/TTS/TTS/trainer.py", line 601, in _optimize
outputs, loss_dict = self._model_train_step(batch, model, criterion)
File "/TTS/TTS/trainer.py", line 560, in _model_train_step
return model.train_step(*input_args)
File "/TTS/TTS/tts/models/glow_tts.py", line 381, in train_step
loss_dict = criterion(
File "/miniconda3/envs/data/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/TTS/TTS/tts/layers/losses.py", line 437, in forward
raise RuntimeError(f" [!] NaN loss with {key}.")
RuntimeError: [!] NaN loss with loss.
Environment (please complete the following information):
- Linux Mint 19.3
- Pytorch 1.9.0
- Python 3.9.6
- CUDA cuda_11.5.r11.5/compiler.30411180_0
- CUDNN 8.3
- GPU: NVIDIA GeForce GTX 1650 with Max-Q Design / Memory 4096 MB / CUDA cores 1024
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech ...
Describe the bug When I train Glow TTS on LJSpeech Spanish set (angelina ... "RuntimeError: [!] NaN loss with loss" on GlowTTS introduction...
Read more >Deep-Learning Nan loss reasons - python - Stack Overflow
You may have an issue with the input data. Try calling assert not np.any(np.isnan(x)) on the input data to make sure you are...
Read more >Re-training SSD-Mobilenet: gt_locations consist of nan values ...
While training, my Avg Loss is reducing slowly but suddenly I'm getting NaN. I followed the following methods but the issue still persists....
Read more >1.1.8 PDF - PyTorch Lightning Documentation
In this guide we'll show you how to organize your PyTorch code into Lightning in 2 steps. Organizing your code with PyTorch Lightning...
Read more >Torch.clamp backward got nan values - PyTorch Forums
In my codes, I have used torch.clamp as follows: epsilon = 1e-6 ypred = torch.clamp(ypred, epsilon, 1-epsilon) and got error message as ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@WeberJulian got it, thank you! Although I read on the coqui FAQ that batch sizes < 32 tend not to converge. I wonder if the coqui team would consider providing a set of “recommended hardware specs” for training and running the coqui models? I for one would find this very helpful!
I think that issue is specific to Tacotron, GlowTTS doesn’t have any issues with alignment convergence. It’s hard to pinpoint minimum specs for each model since the config might affect the hardware requirements. But you can do most of the work on colab.