Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

conformer-ctc small not converge

See original GitHub issue

Describe the bug Not a bug, ask for help and training techniques conformer-ctc small can’t converge on Librispeech 960h

Basic environments:

OS information: Ubuntu 18.04.1 LTS
python version: [e.g. 3.8.5 (default, Sep 24 2020, 16:55:52) [GCC 7.5.0]`]
espnet version: [e.g. espnet 0.9.6]
Git hash [e.g. c84da5743b7ef70c0c6212715859bdebdcf873b2]
- Commit date [e.g. Tue Sep 1 09:32:54 2020 -0400]
pytorch version [e.g. pytorch 1.7.1]

Environments from torch.utils.collect_env: Collecting environment information… PyTorch version: 1.7.1 Is debug build: False CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.1 LTS GCC version: (GCC) 7.5.0 CMake version: version 3.10.2

Python version: 3.8 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: Tesla V100-PCIE GPU 1: Tesla V100-PCIE GPU 2: Tesla V100-PCIE GPU 3: Tesla V100-PCIE

Nvidia driver version: 470.63.01 cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5

Versions of relevant libraries: Versions of relevant libraries: [pip3] numpy==1.20.1 [pip3] pytorch-wpe==0.0.1 [pip3] torch==1.7.1 [pip3] torch-complex==0.2.1 [conda] Could not collect

Task information:

Task: ASR
librispeech 960h
ESPnet2

To Reproduce Actually I want to reproduce this model (https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_ctc_small) in espnet, but I can’t get it converge.

I have tried several parameters, these are my config files. The init one is the default conformer yaml in espnet and I change the ctc weight to 1.0 and the dimensions of encoder. It can’t converge and I tried to change the config it refering to (https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml)

train_asr_conformer_ctc.txt train_asr_conformer_ctc_v1.txt train_asr_conformer_ctc_v2.txt

I also find that the loss doesn’t going down anymore at about epoch2 (maybe it’s due to gradient vanishing?, the same happens for v0 and v2, v1 diverge due to the large lr) loss_ctc

The training logs for v0 and v2 is here (there is warning for no valid stats in the log, does this have any impact on the results?) trainv0.log trainv2.log

I have run the whole process just for one time , the other exp are conducted from stage 10vim

Do you have any suggestions ？ Thanks a lot if you could help