conformer-ctc small not converge
See original GitHub issueDescribe the bug Not a bug, ask for help and training techniques conformer-ctc small can’t converge on Librispeech 960h
Basic environments:
- OS information: Ubuntu 18.04.1 LTS
- python version: [e.g. 3.8.5 (default, Sep 24 2020, 16:55:52) [GCC 7.5.0]`]
- espnet version: [e.g. espnet 0.9.6]
- Git hash [e.g. c84da5743b7ef70c0c6212715859bdebdcf873b2]
- Commit date [e.g. Tue Sep 1 09:32:54 2020 -0400]
- pytorch version [e.g. pytorch 1.7.1]
Environments from torch.utils.collect_env
:
Collecting environment information…
PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.1 LTS GCC version: (GCC) 7.5.0 CMake version: version 3.10.2
Python version: 3.8 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: Tesla V100-PCIE GPU 1: Tesla V100-PCIE GPU 2: Tesla V100-PCIE GPU 3: Tesla V100-PCIE
Nvidia driver version: 470.63.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
Versions of relevant libraries: Versions of relevant libraries: [pip3] numpy==1.20.1 [pip3] pytorch-wpe==0.0.1 [pip3] torch==1.7.1 [pip3] torch-complex==0.2.1 [conda] Could not collect
Task information:
- Task: ASR
- librispeech 960h
- ESPnet2
To Reproduce Actually I want to reproduce this model (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small) in espnet, but I can’t get it converge.
I have tried several parameters, these are my config files. The init one is the default conformer yaml in espnet and I change the ctc weight to 1.0 and the dimensions of encoder. It can’t converge and I tried to change the config it refering to (https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml)
train_asr_conformer_ctc.txt train_asr_conformer_ctc_v1.txt train_asr_conformer_ctc_v2.txt
I also find that the loss doesn’t going down anymore at about epoch2 (maybe it’s due to gradient vanishing?, the same happens for v0 and v2, v1 diverge due to the large lr)
The training logs for v0 and v2 is here (there is warning for no valid stats in the log, does this have any impact on the results?) trainv0.log trainv2.log
I have run the whole process just for one time , the other exp are conducted from stage 10vim
Do you have any suggestions ? Thanks a lot if you could help
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (3 by maintainers)
Top GitHub Comments
train_asr_conformer_ctc_v5.txt It should be this one, you can try it
May you upload your config file for this result.