question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add new speakers and resume training from checkpoint in speaker_id

See original GitHub issue

Hi,

I’ve used your speaker_id module to train a model on a custom dataset. Initially the n_classes parameter has been set to 4 in train.yaml. Now I would like to increase this parameter and add new speakers and resume training from the saved checkpoint. I’ve tried doing this but encountered the following error:

The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
torchvision is not available - cannot save figures
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
The torchaudio backend is switched to 'soundfile'. Note that 'sox_io' is not supported on Windows.
./data\rirs_noises.zip exists. Skipping download
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: ./results/custom_augment
mini_librispeech_prepare - Creating train.json, valid.json, and test.json
mini_librispeech_prepare - train.json successfully created!
mini_librispeech_prepare - valid.json successfully created!
mini_librispeech_prepare - test.json successfully created!
speechbrain.dataio.encoder - Load called, but CategoricalEncoder is not empty. Loaded data will overwrite everything. This is normal if there is e.g. an unk label defined at init.
speechbrain.core - Info: ckpt_interval_minutes arg from hparam file is used
speechbrain.core - 4.5M trainable parameters in SpkIdBrain
speechbrain.utils.checkpoints - Loading a checkpoint from results\custom_augment\save\CKPT+2022-09-26+05-27-32+00
speechbrain.core - Exception:
Traceback (most recent call last):
  File "E:\SpeechBrain\speechbrain\templates\speaker_id\train.py", line 328, in <module>
    spk_id_brain.fit(
  File "E:\SpeechBrain\long-speech\lib\site-packages\speechbrain\core.py", line 1143, in fit
    self.on_fit_start()
  File "E:\SpeechBrain\long-speech\lib\site-packages\speechbrain\core.py", line 797, in on_fit_start
    self.checkpointer.recover_if_possible(
  File "E:\SpeechBrain\long-speech\lib\site-packages\speechbrain\utils\checkpoints.py", line 840, in recover_if_possible
    self.load_checkpoint(chosen_ckpt, device)
  File "E:\SpeechBrain\long-speech\lib\site-packages\speechbrain\utils\checkpoints.py", line 853, in load_checkpoint
    self._call_load_hooks(checkpoint, device)
  File "E:\SpeechBrain\long-speech\lib\site-packages\speechbrain\utils\checkpoints.py", line 988, in _call_load_hooks
    default_hook(obj, loadpath, end_of_epoch, device)
  File "E:\SpeechBrain\long-speech\lib\site-packages\speechbrain\utils\checkpoints.py", line 93, in torch_recovery
    obj.load_state_dict(torch.load(path, map_location=device), strict=True)
  File "E:\SpeechBrain\long-speech\lib\site-packages\torch\nn\modules\module.py", line 1482, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for Classifier:
        size mismatch for out.w.weight: copying a param with shape torch.Size([4, 512]) from checkpoint, the shape in current model is torch.Size([7, 512]).
        size mismatch for out.w.bias: copying a param with shape torch.Size([4]) from checkpoint, the shape in current model is torch.Size([7]).

I’ve tried to increase the n_classes parameter to 7 and made the additions in label_encoder.txt manually. I understand that this addition of new speakers in causing this issue. Is there any solution or workaround for this, to continue building on the existing model by taking advantage of the checkpointing feature?

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8

github_iconTop GitHub Comments

1reaction
eviltyphacommented, Oct 6, 2022

Hi @anautsch,

model = torch.load("results/.../save/CKPT.../classifier.ckpt")
model['out.w.weight'] # Edit
model['out.w.bias'] # Edit

out.w.weight dimensions are [n_classes, emb_dim], you’ll have to add n lists of zeros depending on the number of speakers you want to increase. out.w.bias dimensions are [n_classes], for this append n zeros to the end.

model = torch.load("results/.../save/CKPT.../optimizer.ckpt")
model['state'][28]['exp_avg'] # Edit
model['state'][28]['exp_avg_sq'] # Edit
model['state'][29]['exp_avg'] # Edit
model['state'][29]['exp_avg_sq'] # Edit

model['state'][28]['exp_avg'] and model['state'][28]['exp_avg_sq'] dimensions are [n_classes, emb_dim], 2D so you’ll have to append n lists of zeros. model['state'][29]['exp_avg'] and model['state'][29]['exp_avg_sq'] have dimensions [n_classes], this would require appending n zeros at the end.

All of these values are of type Tensor so I had to convert them to lists, append, then convert back to tensors and assign them to their respective keys. I’m not sure how adding zeros would affect the performance of the model or even if it’s the right way to go about.

The training rate seemed to be decreasing every time the training resumed, need to check it once. I trained it on a very limited dataset, but the score with which it was verifying speakers seemed decent, this verification was on the audio files used for training not new ones.

0reactions
eviltyphacommented, Oct 7, 2022

Thanks for your help and support

Read more comments on GitHub >

github_iconTop Results From Across the Web

Add ability to resume training from latest checkpoint without ...
Add some kind of method to recursively go over everything in logs/, and find the latest saved checkpoint (find by date saved).
Read more >
Saving and Loading Your Model to Resume Training in PyTorch
So in this post, we will be talking about how to save your model in the form of checkpoints and how to load...
Read more >
Resume Training from Checkpoint Network - MathWorks
This example shows how to save checkpoint networks while training a deep learning network and resume training from a previously saved network.
Read more >
deepvoice3_pytorch PyTorch Model - Model Zoo
DeepVoice3: Multi-speaker text-to-speech demo ... Add dilated convolution, more channels, more layers and add guided attention loss, etc.
Read more >
Checkpoints — NVIDIA NeMo
Checkpoints #. There are two main ways to load pretrained checkpoints in NeMo: Using the restore_from() method to load a local checkpoint file...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found