Conformer training from scratch not achieving the reported WER
See original GitHub issueHello! Thanks for providing such a nice toolkit.
I am currently training conformer-small on librispeech-960hr from scratch, following the training recipe and instructions in examples/asr/conf/conformer/conformer_ctc_bpe.yaml
.
However, I’m not able to reproduce the WER reported in this webpage: reported ones are 3.4/8.8, but our WER is only around 4.5/11 on the clean/other datasets with ~300 epochs of training and doesn’t seem to decrease anymore.
Here is the list of hyperparameters I’m using:
- d_model, n_heads, n_layers: 176, 16, 16
- tokenizer: WPE with vocab_size 128 (generated using the script in NeMo)
- spec_augment: freq_masks 2, time_masks 5, freq_width:27, time_width: 0.05
- lr: 5.0, betas: [0.9, 0.98], weight_decay: 1e-3 (adamw)
- warmup_steps: 10000 (noam annealing)
- Batch size: 2048 (using 8 machines, local batch size 16, and accumulate_grad_batches 16, multiplying up to 2048) All the other hyperparameters are the same as the default.
Here is one of the learning curves (validation WER on dev-clean) from our runs:
I’m wondering if we should change the hyperparameters to achieve the reported WER values, or if the 300-epoch of training is not enough for the model convergence, or if there’s something that we are missing. Any clues regarding the hyperparameters, the number of training epochs, etc. would be really appreciated!
Thank you.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:20
That examples/asr/conf/conformer/conformer_ctc_bpe.yaml is the config we used to train the Conformer-CTC-Large model. You just need to update the configs mentioned in the table you see in the file. The number of heads should be 4 not 16. Unfortunately Conformer-CTC models for LS need a lot of time to converge to SOTA numbers. Among them the small version is the worst. We trained Conformer-CTC models for 1000 epochs on LS with SP of ±10%. It is equivalent to 3000 epochs on regular LS. However the effectiveness of SP is very limited and in most cases insignificant. If you have just trained for 300 epochs, those accuracies looks OK, you need significantly more epochs to train.
Conformer-CTC vs Conformer-Transducer: Larger models are faster in convergence, maybe it is more time-effective if you train larger models to get better accuracy in fewer epochs, or use Conformer-Transducer model. Transducer models may need more GPU memory to train but they are significantly faster to converge than CTC models. For example a Conformer-Transducer-Large converges to less than 7% on test-other LS in less than 40 epochs. A conformer-CTC large converges to under 8% in around 100 epochs. (All epoch numbers we report on training Conformer on LS are with SP)
Tokenizer: If I remember correctly we used senetencepiece unigram tokenizer with that model but WPE should give the same results.
Some notes on training that model with limited number of GPUs: We used effective batch size of 2K as we wanted to be able to do so many experiments and parameters fine-tuning. You don’t need to use such a large batch size. Actually you can get to better accuracy in fewer epochs with smaller batch sizes as you have more updates to the model. I suggest to use effective batch size of 256 (or 512 at most) which means just 2 grad accumulation in your case. That many grad accumulations you used is not efficient in terms of convergence speed. In this case, you may use lr=2 with weight decay of 0.0 to have faster convergence.
Reproducbility: I trained them more than 9 months ago, need some time to dig them up. I would check the details to make sure it matches the config file we provided. In the meantime, you may load the pretrained checkpoint from NGC and check it’s config in model.cfg. I am also going to train this model with batch size of 256 and 2K again to make sure nothing is broken from the time I trained this model. It may take a week or so to get the results. Would update you guys here with results.
I did some experiments with different tokenizers, different batch sizes, different lr and weight decay to make sure that results are reproducible. Experiments are not done yet but here are some of the initial results. Note that you need to multiply the number of epochs by 3 if you do not use the speed perturbation to match our number of epochs.
Your numbers looks OK to me. You just need more than 1800 epochs without SP to get to 8.8% on test-other LS. I suggest you to switch to Conformer-transducer or a larger Conformer-CTC (maybe medium size). Even medium size (30M) is significantly faster than the small model. You may try fp16 with medium size model to improve memory efficiency. You may get to under 10% test-other in 500 epochs withput SP. Also update your code with the main branch of nemo as you may be using the buggy spec aug. That bug limited the augmentation to just the beginning of an audio. If you decide to start a new training, I suggest to switch to senetencepiece unigram with vocab size of 128, use batch size of 256 or 512 considering your limited number of GPUs. You may start a new training by initialization of the new model with one of the checkpoints you already trained to make the convergence faster.
By the way, you may store the configs of a pretrained model like this:
Same works with restore from: