Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conformer training from scratch not achieving the reported WER

See original GitHub issue

Hello! Thanks for providing such a nice toolkit.

I am currently training conformer-small on librispeech-960hr from scratch, following the training recipe and instructions in examples/asr/conf/conformer/conformer_ctc_bpe.yaml. However, I’m not able to reproduce the WER reported in this webpage: reported ones are 3.4/8.8, but our WER is only around 4.5/11 on the clean/other datasets with ~300 epochs of training and doesn’t seem to decrease anymore.

Here is the list of hyperparameters I’m using:

d_model, n_heads, n_layers: 176, 16, 16
tokenizer: WPE with vocab_size 128 (generated using the script in NeMo)
spec_augment: freq_masks 2, time_masks 5, freq_width:27, time_width: 0.05
lr: 5.0, betas: [0.9, 0.98], weight_decay: 1e-3 (adamw)
warmup_steps: 10000 (noam annealing)
Batch size: 2048 (using 8 machines, local batch size 16, and accumulate_grad_batches 16, multiplying up to 2048) All the other hyperparameters are the same as the default.

Here is one of the learning curves (validation WER on dev-clean) from our runs: Screen Shot 2021-12-05 at 1 18 53 PM

I’m wondering if we should change the hyperparameters to achieve the reported WER values, or if the 300-epoch of training is not enough for the model convergence, or if there’s something that we are missing. Any clues regarding the hyperparameters, the number of training epochs, etc. would be really appreciated!

Thank you.

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:20

Top GitHub Comments

12reactions

VahidooXcommented, Dec 7, 2021

That examples/asr/conf/conformer/conformer_ctc_bpe.yaml is the config we used to train the Conformer-CTC-Large model. You just need to update the configs mentioned in the table you see in the file. The number of heads should be 4 not 16. Unfortunately Conformer-CTC models for LS need a lot of time to converge to SOTA numbers. Among them the small version is the worst. We trained Conformer-CTC models for 1000 epochs on LS with SP of ±10%. It is equivalent to 3000 epochs on regular LS. However the effectiveness of SP is very limited and in most cases insignificant. If you have just trained for 300 epochs, those accuracies looks OK, you need significantly more epochs to train.

Conformer-CTC vs Conformer-Transducer: Larger models are faster in convergence, maybe it is more time-effective if you train larger models to get better accuracy in fewer epochs, or use Conformer-Transducer model. Transducer models may need more GPU memory to train but they are significantly faster to converge than CTC models. For example a Conformer-Transducer-Large converges to less than 7% on test-other LS in less than 40 epochs. A conformer-CTC large converges to under 8% in around 100 epochs. (All epoch numbers we report on training Conformer on LS are with SP)

Tokenizer: If I remember correctly we used senetencepiece unigram tokenizer with that model but WPE should give the same results.

Some notes on training that model with limited number of GPUs: We used effective batch size of 2K as we wanted to be able to do so many experiments and parameters fine-tuning. You don’t need to use such a large batch size. Actually you can get to better accuracy in fewer epochs with smaller batch sizes as you have more updates to the model. I suggest to use effective batch size of 256 (or 512 at most) which means just 2 grad accumulation in your case. That many grad accumulations you used is not efficient in terms of convergence speed. In this case, you may use lr=2 with weight decay of 0.0 to have faster convergence.

Reproducbility: I trained them more than 9 months ago, need some time to dig them up. I would check the details to make sure it matches the config file we provided. In the meantime, you may load the pretrained checkpoint from NGC and check it’s config in model.cfg. I am also going to train this model with batch size of 256 and 2K again to make sure nothing is broken from the time I trained this model. It may take a week or so to get the results. Would update you guys here with results.

5reactions

VahidooXcommented, Dec 17, 2021

I did some experiments with different tokenizers, different batch sizes, different lr and weight decay to make sure that results are reproducible. Experiments are not done yet but here are some of the initial results. Note that you need to multiply the number of epochs by 3 if you do not use the speed perturbation to match our number of epochs.

Senetencepiece unigram is better than WPE (range of 0.3% after 300 epochs)
lr=5, wd=1e-3 is better than lr=2, wd=0 for small conformer-ctc (better in range of 0.2% after 300 epochs)
WER with SP is slightly better on test-other (less than 0.2%/0.1% for test-other/test-clean after 300 epochs)
Tried the model with bs=256 and bs=512 and they converge faster epoch-wise.
Model with bs=2K and SP gets to 8.8%/3.4% on test-other/test-clean in 620 epochs (eq to 620*3=1860 without SP)
Model with bs=2K and SP gets to 10.6%/4.2% on test-other/test-clean in 100 epochs (eq to 100*3=300 without SP)
Model with bs=256 and without SP gets to 10.2%/4.0% on test-other/test-clean in 460 epochs (eq to 100/3~153 with SP)
Model with bs=512 and without SP gets to 9.5%/3.7% on test-other/test-clean in 1000 epochs (eq to 1000/3~333 with SP)

Your numbers looks OK to me. You just need more than 1800 epochs without SP to get to 8.8% on test-other LS. I suggest you to switch to Conformer-transducer or a larger Conformer-CTC (maybe medium size). Even medium size (30M) is significantly faster than the small model. You may try fp16 with medium size model to improve memory efficiency. You may get to under 10% test-other in 500 epochs withput SP. Also update your code with the main branch of nemo as you may be using the buggy spec aug. That bug limited the augmentation to just the beginning of an audio. If you decide to start a new training, I suggest to switch to senetencepiece unigram with vocab size of 128, use batch size of 256 or 512 considering your limited number of GPUs. You may start a new training by initialization of the new model with one of the checkpoints you already trained to make the convergence faster.

By the way, you may store the configs of a pretrained model like this:

Cfg = EncDecCTCModel.from_pretrained("stt_en_conformer_ctc_small", return_config=True)
OmegaConf.save(Cfg, "filename.yaml")

Same works with restore from:

Cfg = ModelPT.restore_from(..., return_config=True)
OmegaConf.save(Cfg, "filename.yaml")

Top Results From Across the Web

Pushing the Limits of Semi-Supervised Learning for Automatic ...

As we present in the next section, we have been unsuccessful in achieving gains from training Conformer. XL and XXL from scratch, and...

EFFICIENT CONFORMER - Archive ouverte HAL

Our Efficient Conformer Trans- ducer model achieves satisfying results but still lack behind the original work that was trained with larger ...

Understanding the Role of Self Attention for Efficient Speech ...

As the models are trained from scratch, we believe the result above supports that the long-range context may not be essential for linguistic...

Conformer‐RL: A deep reinforcement learning library for conformer ...

Abstract Conformer-RL is an open-source Python package for applying deep ... Nevertheless, building and training these models from scratch can be difficult ...

Pseudo Label Is Better Than Human Label | DeepAI

Pseudo-labeling is the most adopted method for pre-training automatic sp. ... W2v-BERT [chung2021w2v] pre-training can achieve 2.8% WER on the testother set ...