question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

I tried to run a librispeech transformer recipe with 8 GPU but a word error rate remains very large.

See original GitHub issue

I tried to run a librispeech transformer recipe with 8 GPU by using DDP (https://github.com/speechbrain/speechbrain/blob/develop/recipes/LibriSpeech/ASR/transformer/train.py) but a word error rate remains very large(around 100%) in spite of 10 epochs.

epoch: 1, lr: 1.00e+00, steps: 4056, optimizer: Adam - train loss: 2.50e+02 - valid loss: 1.32e+02, valid ACC: 1.97e-01
epoch: 2, lr: 1.17e-04, steps: 12845, optimizer: Adam - train loss: 2.11e+02 - valid loss: 1.27e+02, valid ACC: 2.24e-01
epoch: 3, lr: 1.97e-04, steps: 21634, optimizer: Adam - train loss: 2.04e+02 - valid loss: 1.25e+02, valid ACC: 2.43e-01
epoch: 4, lr: 2.07e-04, steps: 30423, optimizer: Adam - train loss: 1.98e+02 - valid loss: 1.23e+02, valid ACC: 2.58e-01
epoch: 5, lr: 1.82e-04, steps: 39212, optimizer: Adam - train loss: 1.93e+02 - valid loss: 1.21e+02, valid ACC: 2.67e-01
epoch: 6, lr: 1.65e-04, steps: 48001, optimizer: Adam - train loss: 1.89e+02 - valid loss: 1.21e+02, valid ACC: 2.71e-01
epoch: 7, lr: 1.51e-04, steps: 56790, optimizer: Adam - train loss: 1.85e+02 - valid loss: 1.21e+02, valid ACC: 2.70e-01
epoch: 8, lr: 1.41e-04, steps: 65579, optimizer: Adam - train loss: 1.82e+02 - valid loss: 1.22e+02, valid ACC: 2.67e-01
epoch: 9, lr: 1.32e-04, steps: 74368, optimizer: Adam - train loss: 1.79e+02 - valid loss: 1.23e+02, valid ACC: 2.64e-01
epoch: 10, lr: 1.25e-04, steps: 83157, optimizer: Adam - train loss: 1.76e+02 - valid loss: 1.24e+02, valid ACC: 2.61e-01, valid WER: 96.31

I ran the following command.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 train.py hparams/transformer.yaml --distributed_launch --distributed_backend='nccl'

I reduced batch_size from 16 to 4 in order to avoid Out Of Memory Error and changed gradient_accumulation from 4 to 1 according to https://github.com/speechbrain/speechbrain/issues/899. I also tried to train setting gradient_accumulation to 4 and 2, but the results were no different. My environment is as follows.

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-1063-aws-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 460.106.00

A commit hash of the speechbrain is d6bfe13. Could you give me any hint? Thanks for your help.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15

github_iconTop GitHub Comments

2reactions
TParcolletcommented, Mar 24, 2022

Hey folks, we updated the whole librispeech recipe. Now the model should be 1. better; 2. much smaller and therefore easier to train with less GPUs 😃

1reaction
ken57commented, May 17, 2022

Thank you for very kind supports. We could train models by using the new script. So, I will close this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bad WER for new dataset · Issue #1420 · espnet ... - GitHub
I'm interested in ESPnet with a new dataset, I have done some experiments, but I got bad WER. The dataset, which I used...
Read more >
README.md · speechbrain/asr-wav2vec2-librispeech at main
This repository provides all the necessary tools to perform automatic speech recognition from an end-to-end system pretrained on LibriSpeech ( ...
Read more >
LibriSpeech test-clean Benchmark (Speech Recognition)
Rank Model Word Error Rate (WER) Year 2 w2v‑BERT XXL 1.4 2021 3 Conv + Transformer + wav2vec2.0 + pseudo labeling 1.5 2020 5 SpeechStew (1B)...
Read more >
Performance Evaluation of Offline Speech - ProQuest
On the Jetson Nano GPU, the inference latency is three to five times better, compared to Raspberry Pi. The word error rate on...
Read more >
Shrinking Bigfoot: Reducing wav2vec 2.0 footprint - arXiv Vanity
With only 10 minutes of labeled data, wav2vec 2.0 achieves word error rates (WER) of 4.8% and 8.2 % on LibriSpeech [librispeech] dev-clean...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found