[Bug] Error in training Capacitron
See original GitHub issueDescribe the bug
I was training a Capacitron
model with my own dataset (bn-BD
, 12-hour). Training started successfully, but it stopped after 27 epochs (around 5500 steps) with the following error message:
...
ValueError: Expected parameter loc (Tensor of shape (48, 128)) of distribution MultivariateNormal(loc: torch.Size([48, 128]), covariance_matrix: torch.Size([48, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], grad_fn=<ExpandBackward0>)
I believe it’s a PyTorch
issue. Can someone guide me solving this problem?
To Reproduce
I was doing this experiment in colab
. Here’s the notebook: link
Here’s the config.json file.
Expected behavior
No response
Logs
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1534, in fit
self._fit()
File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1518, in _fit
self.train_epoch()
File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1283, in train_epoch
_, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1124, in train_step
num_optimizers=1,
File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 998, in _optimize
outputs, loss_dict = self._model_train_step(batch, model, criterion)
File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 954, in _model_train_step
return model.train_step(*input_args)
File "/usr/local/lib/python3.7/dist-packages/TTS/tts/models/tacotron2.py", line 327, in train_step
outputs = self.forward(text_input, text_lengths, mel_input, mel_lengths, aux_input)
File "/usr/local/lib/python3.7/dist-packages/TTS/tts/models/tacotron2.py", line 198, in forward
speaker_embedding=embedded_speakers if self.capacitron_vae.capacitron_use_speaker_embedding else None,
File "/usr/local/lib/python3.7/dist-packages/TTS/tts/models/base_tacotron.py", line 257, in compute_capacitron_VAE_embedding
speaker_embedding, # pylint: disable=not-callable
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/TTS/tts/layers/tacotron/capacitron_layers.py", line 66, in forward
self.approximate_posterior_distribution = MVN(mu, torch.diag_embed(sigma))
File "/usr/local/lib/python3.7/dist-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
File "/usr/local/lib/python3.7/dist-packages/torch/distributions/distribution.py", line 56, in __init__
f"Expected parameter {param} "
ValueError: Expected parameter loc (Tensor of shape (48, 128)) of distribution MultivariateNormal(loc: torch.Size([48, 128]), covariance_matrix: torch.Size([48, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], grad_fn=<ExpandBackward0>)
Environment
{
"CUDA": {
"GPU": [
"Tesla T4"
],
"available": true,
"version": "11.3"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.12.0+cu113",
"TTS": "0.7.1",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
""
],
"processor": "x86_64",
"python": "3.7.13",
"version": "#1 SMP Sun Apr 24 10:03:06 PDT 2022"
}
}
Additional context
No response
Issue Analytics
- State:
- Created a year ago
- Comments:13 (7 by maintainers)
Top Results From Across the Web
Error in training Capacitron · Issue #1922 · coqui-ai/TTS - GitHub
Training capacitron is hard since it's pretty unstable. Try using the latest recipe since it improved stability (at least for alignments), you ...
Read more >Awareness Program completed training records UI missing ...
On the awareness program feature we lost the ability to see when someone completed a training, that was a filter option we no...
Read more >Why Does Software Have Bugs?
There are many reasons for the occurrence of Software Bugs. The most common reason is human mistakes in software design and coding.
Read more >speech: Models, code, and papers - CatalyzeX
Browse machine learning models and code for speech to catalyze your projects, and easily connect with engineers and experts when you need help....
Read more >Random Forest Regressor-Based Approach for Detecting ...
We propose a random forest regressor (RFR)- based model to detect fault locations and predict their duration simultaneously. From a machine ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@WeberJulian