Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] IndexError while training new VITS LJSpeech recipe

See original GitHub issue

🐛 Description

Training crashed while training the new VITS LJSpeech recipe, here is the output:

   --> STEP: 395/405 -- GLOBAL_STEP: 246025
     | > loss_disc: nan  (nan)
     | > loss_disc_real_0: nan  (nan)
     | > loss_disc_real_1: nan  (nan)
     | > loss_disc_real_2: nan  (nan)
     | > loss_disc_real_3: nan  (nan)
     | > loss_disc_real_4: nan  (nan)
     | > loss_disc_real_5: nan  (nan)
     | > amp_scaler: 0.00000  (0.00000)
     | > loss_0: nan  (nan)
     | > grad_norm_0: 0.00000  (0.00000)
     | > loss_gen: nan  (nan)
     | > loss_kl: nan  (nan)
     | > loss_feat: nan  (nan)
     | > loss_mel: 21.53698  (21.33673)
     | > loss_duration: nan  (nan)
     | > loss_1: nan  (nan)
     | > grad_norm_1: 0.00000  (0.07407)
     | > current_lr_0: 0.00019
     | > current_lr_1: 0.00019
     | > step_time: 0.75520  (0.56876)
     | > loader_time: 0.05890  (0.04195)

 ! Run is kept in /home/fijipants/repo/coqui-0.6.1/runs/vits_ljspeech-March-07-2022_11+31AM-0cf3265a
Traceback (most recent call last):
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 1403, in fit
    self._fit()
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 1387, in _fit
    self.train_epoch()
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 1167, in train_epoch
    _, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 1031, in train_step
    step_optimizer=step_optimizer,
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 888, in _optimize
    outputs, loss_dict = self._model_train_step(batch, model, criterion, optimizer_idx=optimizer_idx)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/trainer/trainer.py", line 846, in _model_train_step
    return model.train_step(*input_args)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/TTS/tts/models/vits.py", line 1062, in train_step
    aux_input={"d_vectors": d_vectors, "speaker_ids": speaker_ids, "language_ids": language_ids},
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/TTS/tts/models/vits.py", line 875, in forward
    outputs, attn = self.forward_mas(outputs, z_p, m_p, logs_p, x, x_mask, y_mask, g=g, lang_emb=lang_emb)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/TTS/tts/models/vits.py", line 784, in forward_mas
    attn = maximum_path(logp, attn_mask.squeeze(1)).unsqueeze(1).detach()  # [b, 1, t, t']
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/TTS/tts/utils/helpers.py", line 177, in maximum_path
    return maximum_path_numpy(value, mask)
  File "/home/fijipants/miniconda3/envs/coqui-0.6.1/lib/python3.7/site-packages/TTS/tts/utils/helpers.py", line 234, in maximum_path_numpy
    path[index_range, index, j] = 1
IndexError: index -329 is out of bounds for axis 1 with size 328

To Reproduce

Modify the VITS LJSpeech recipe’s dataset_config to point to your LJSpeech folder.
Run the training with CUDA_VISIBLE_DEVICES=0
Wait 246k steps and pray

Expected behavior

It doesn’t crash

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce RTX 3090",
            "NVIDIA GeForce RTX 3090"
        ],
        "available": true,
        "version": "11.3"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.10.2",
        "TTS": "0.6.1",
        "numpy": "1.21.2"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "x86_64",
        "python": "3.7.11",
        "version": "#202202230823 SMP PREEMPT Wed Feb 23 14:53:24 UTC 2022"
    }
}

Additional context

Issue Analytics

State:
Created 2 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

fijipantscommented, Mar 12, 2022

@fijipants Are you running it on the new release (0.6.1)? Or the old one? Did the model train okay before 246k? I’m just curious because I’m considering upgrading TTS to the new release.

It’s on the new release (0.6.1), and it trained pretty well before 246k but started to get very weird around 246k.

Here's some samples

230k:

https://user-images.githubusercontent.com/88913682/158008781-bd03c25f-439a-4df3-a9db-5b2f8ee5013b.mp4

244k:

https://user-images.githubusercontent.com/88913682/158008786-3f1545cc-c169-4108-ab10-d65f84843cbe.mp4

245k:

https://user-images.githubusercontent.com/88913682/158008789-fe7cb1b0-c5fb-4110-bd2c-f0262a78b9b6.mp4

At least it’s much better than the results I had for v0.5.0 (you can see them in #1309.)

I tried resuming the current training but it only got worse, and around 260k all the values became NaN and the audio became a blaringly loud noise. I’ve since started a new training from scratch which hopefully won’t run into this issue, and if it does, I’ll make another bug report.

0reactions

e0xextazycommented, Mar 28, 2022