question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Error in training Capacitron

See original GitHub issue

Describe the bug

I was training a Capacitron model with my own dataset (bn-BD, 12-hour). Training started successfully, but it stopped after 27 epochs (around 5500 steps) with the following error message:

...
ValueError: Expected parameter loc (Tensor of shape (48, 128)) of distribution MultivariateNormal(loc: torch.Size([48, 128]), covariance_matrix: torch.Size([48, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<ExpandBackward0>)

I believe it’s a PyTorch issue. Can someone guide me solving this problem?

To Reproduce

I was doing this experiment in colab. Here’s the notebook: link

Here’s the config.json file.

Expected behavior

No response

Logs

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1534, in fit
    self._fit()
  File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1518, in _fit
    self.train_epoch()
  File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1283, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 1124, in train_step
    num_optimizers=1,
  File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 998, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "/usr/local/lib/python3.7/dist-packages/trainer/trainer.py", line 954, in _model_train_step
    return model.train_step(*input_args)
  File "/usr/local/lib/python3.7/dist-packages/TTS/tts/models/tacotron2.py", line 327, in train_step
    outputs = self.forward(text_input, text_lengths, mel_input, mel_lengths, aux_input)
  File "/usr/local/lib/python3.7/dist-packages/TTS/tts/models/tacotron2.py", line 198, in forward
    speaker_embedding=embedded_speakers if self.capacitron_vae.capacitron_use_speaker_embedding else None,
  File "/usr/local/lib/python3.7/dist-packages/TTS/tts/models/base_tacotron.py", line 257, in compute_capacitron_VAE_embedding
    speaker_embedding,  # pylint: disable=not-callable
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/TTS/tts/layers/tacotron/capacitron_layers.py", line 66, in forward
    self.approximate_posterior_distribution = MVN(mu, torch.diag_embed(sigma))
  File "/usr/local/lib/python3.7/dist-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributions/distribution.py", line 56, in __init__
    f"Expected parameter {param} "
ValueError: Expected parameter loc (Tensor of shape (48, 128)) of distribution MultivariateNormal(loc: torch.Size([48, 128]), covariance_matrix: torch.Size([48, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<ExpandBackward0>)

Environment

{
    "CUDA": {
        "GPU": [
            "Tesla T4"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.12.0+cu113",
        "TTS": "0.7.1",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#1 SMP Sun Apr 24 10:03:06 PDT 2022"
    }
}

Additional context

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
erogolcommented, Aug 7, 2022
0reactions
manmay-nakhashicommented, Aug 27, 2022
Traceback (most recent call last):
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1533, in fit
    self._fit()
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1517, in _fit
    self.train_epoch()
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1282, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 1114, in train_step
    outputs, loss_dict_new, step_time = self._optimize(
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 998, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion)
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/trainer-0.0.14-py3.8.egg/trainer/trainer.py", line 954, in _model_train_step
    return model.train_step(*input_args)
  File "/home/manmay/TTS/TTS/tts/models/tacotron2.py", line 352, in train_step
    outputs = self.forward(text_input, text_lengths, mel_input, mel_lengths, aux_input)
  File "/home/manmay/TTS/TTS/tts/models/tacotron2.py", line 216, in forward
    encoder_outputs, *capacitron_vae_outputs = self.compute_capacitron_VAE_embedding(
  File "/home/manmay/TTS/TTS/tts/models/base_tacotron.py", line 254, in compute_capacitron_VAE_embedding
    (VAE_outputs, posterior_distribution, prior_distribution, capacitron_beta,) = self.capacitron_vae_layer(
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/manmay/TTS/TTS/tts/layers/tacotron/capacitron_layers.py", line 67, in forward
    self.approximate_posterior_distribution = MVN(mu, torch.diag_embed(sigma))
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/torch/distributions/multivariate_normal.py", line 146, in __init__
    super(MultivariateNormal, self).__init__(batch_shape, event_shape, validate_args=validate_args)
  File "/opt/conda/envs/coqui/lib/python3.8/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (128, 128)) of distribution MultivariateNormal(loc: torch.Size([128, 128]), covariance_matrix: torch.Size([128, 128, 128])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<ExpandBackward0>)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Error in training Capacitron · Issue #1922 · coqui-ai/TTS - GitHub
Training capacitron is hard since it's pretty unstable. Try using the latest recipe since it improved stability (at least for alignments), you ...
Read more >
Awareness Program completed training records UI missing ...
On the awareness program feature we lost the ability to see when someone completed a training, that was a filter option we no...
Read more >
Why Does Software Have Bugs?
There are many reasons for the occurrence of Software Bugs. The most common reason is human mistakes in software design and coding.
Read more >
speech: Models, code, and papers - CatalyzeX
Browse machine learning models and code for speech to catalyze your projects, and easily connect with engineers and experts when you need help....
Read more >
Random Forest Regressor-Based Approach for Detecting ...
We propose a random forest regressor (RFR)- based model to detect fault locations and predict their duration simultaneously. From a machine ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found