"RuntimeError: [!] NaN loss with loss" on GlowTTS introduction example - mailabs dataset
See original GitHub issueDescribe the bug
After running the introduction example on a single speaker ljspeech, I switched to mailabs format where the speaker are derived from the folder structure. I am getting an exception after 75 steps.
I set this mixed_precision=False,
per this related bug, but still observe this behavior.
To Reproduce
Run the tutorial with modified config.
Expected behavior
No response
Logs
! Run is kept in /apps/tts/data/output/glow_tts_en-June-23-2022_02+23PM-00e67092
Traceback (most recent call last):
File "/apps/tts/Trainer/trainer/trainer.py", line 1501, in fit
self._fit()
File "/apps/tts/Trainer/trainer/trainer.py", line 1485, in _fit
self.train_epoch()
File "/apps/tts/Trainer/trainer/trainer.py", line 1259, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/apps/tts/Trainer/trainer/trainer.py", line 1101, in train_step
num_optimizers=len(self.optimizer) if isinstance(self.optimizer, list) else 1,
File "/apps/tts/Trainer/trainer/trainer.py", line 979, in _optimize
outputs, loss_dict = self._model_train_step(batch, model, criterion)
File "/apps/tts/Trainer/trainer/trainer.py", line 935, in _model_train_step
return model.train_step(*input_args)
File "/apps/tts/TTS/TTS/tts/models/glow_tts.py", line 425, in train_step
text_lengths,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/apps/tts/TTS/TTS/tts/layers/losses.py", line 494, in forward
raise RuntimeError(f" [!] NaN loss with {key}.")
RuntimeError: [!] NaN loss with loss.
Environment
{
"CUDA": {
"GPU": [
"A100-SXM4-40GB"
],
"available": true,
"version": "11.5"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0+cu115",
"TTS": "0.7.1",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.7.13",
"version": "#77~18.04.1-Ubuntu SMP Thu Apr 7 21:38:47 UTC 2022"
}
}
Additional context
No response
Issue Analytics
- State:
- Created a year ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Glow TTS Avg Loss Not Decreasing - Spanish LJSpeech ...
Describe the bug When I train Glow TTS on LJSpeech Spanish set (angelina ... "RuntimeError: [!] NaN loss with loss" on GlowTTS introduction...
Read more >'Model diverged with loss = NaN' , when number of classes ...
Looks like you are dealing with an imbalanced dataset and you are adding a small value when you have no classes in the...
Read more >Re-training SSD-Mobilenet: gt_locations consist of nan values ...
I'm training SSD-Mobilenet Model on Bosch Small Traffic Lights Dataset. While training, my Avg Loss is reducing slowly but suddenly I'm getting ...
Read more >TTS - bytemeta
"RuntimeError: [!] NaN loss with loss" on GlowTTS introduction example - mailabs dataset. hengway. hengway CLOSED · Updated 2 months ago ...
Read more >[Bug] AttributeError: 'AttrDict' object has no attribute ... - Coder Social
... "tts_models/ru/ruslan/tacotron2-DDC" --out_path example.wav ... NaN loss with loss" on GlowTTS introduction example - mailabs dataset HOT 4; [Bug] Can't ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think the two issues are related, but not a duplicate. This #1683 issue is about MAILABS format causing glow tts exception, while #1750 is about ljspeech Spanish dataset training for glow_tts avg_loss staying constant. The overlap for both of these issues is that once I switched the Spanish dataset to mixed precision, I observed the same exception as the one described for this #1683.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.