Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

I tried to run a librispeech transformer recipe with 8 GPU but a word error rate remains very large.

See original GitHub issue

I tried to run a librispeech transformer recipe with 8 GPU by using DDP (https://github.com/speechbrain/speechbrain/blob/develop/recipes/LibriSpeech/ASR/transformer/train.py) but a word error rate remains very large(around 100%) in spite of 10 epochs.

epoch: 1, lr: 1.00e+00, steps: 4056, optimizer: Adam - train loss: 2.50e+02 - valid loss: 1.32e+02, valid ACC: 1.97e-01
epoch: 2, lr: 1.17e-04, steps: 12845, optimizer: Adam - train loss: 2.11e+02 - valid loss: 1.27e+02, valid ACC: 2.24e-01
epoch: 3, lr: 1.97e-04, steps: 21634, optimizer: Adam - train loss: 2.04e+02 - valid loss: 1.25e+02, valid ACC: 2.43e-01
epoch: 4, lr: 2.07e-04, steps: 30423, optimizer: Adam - train loss: 1.98e+02 - valid loss: 1.23e+02, valid ACC: 2.58e-01
epoch: 5, lr: 1.82e-04, steps: 39212, optimizer: Adam - train loss: 1.93e+02 - valid loss: 1.21e+02, valid ACC: 2.67e-01
epoch: 6, lr: 1.65e-04, steps: 48001, optimizer: Adam - train loss: 1.89e+02 - valid loss: 1.21e+02, valid ACC: 2.71e-01
epoch: 7, lr: 1.51e-04, steps: 56790, optimizer: Adam - train loss: 1.85e+02 - valid loss: 1.21e+02, valid ACC: 2.70e-01
epoch: 8, lr: 1.41e-04, steps: 65579, optimizer: Adam - train loss: 1.82e+02 - valid loss: 1.22e+02, valid ACC: 2.67e-01
epoch: 9, lr: 1.32e-04, steps: 74368, optimizer: Adam - train loss: 1.79e+02 - valid loss: 1.23e+02, valid ACC: 2.64e-01
epoch: 10, lr: 1.25e-04, steps: 83157, optimizer: Adam - train loss: 1.76e+02 - valid loss: 1.24e+02, valid ACC: 2.61e-01, valid WER: 96.31

I ran the following command.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 train.py hparams/transformer.yaml --distributed_launch --distributed_backend='nccl'

I reduced batch_size from 16 to 4 in order to avoid Out Of Memory Error and changed gradient_accumulation from 4 to 1 according to https://github.com/speechbrain/speechbrain/issues/899. I also tried to train setting gradient_accumulation to 4 and 2, but the results were no different. My environment is as follows.

PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27

Python version: 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:59:51)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-1063-aws-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB
GPU 4: Tesla V100-SXM2-16GB
GPU 5: Tesla V100-SXM2-16GB
GPU 6: Tesla V100-SXM2-16GB
GPU 7: Tesla V100-SXM2-16GB

Nvidia driver version: 460.106.00

A commit hash of the speechbrain is d6bfe13. Could you give me any hint? Thanks for your help.

Issue Analytics

State:
Created 2 years ago
Comments:15

Top GitHub Comments

2reactions

TParcolletcommented, Mar 24, 2022

Hey folks, we updated the whole librispeech recipe. Now the model should be 1. better; 2. much smaller and therefore easier to train with less GPUs 😃

1reaction

ken57commented, May 17, 2022

Thank you for very kind supports. We could train models by using the new script. So, I will close this issue.

Top Results From Across the Web

Bad WER for new dataset · Issue #1420 · espnet ... - GitHub

I'm interested in ESPnet with a new dataset, I have done some experiments, but I got bad WER. The dataset, which I used...

README.md · speechbrain/asr-wav2vec2-librispeech at main

This repository provides all the necessary tools to perform automatic speech recognition from an end-to-end system pretrained on LibriSpeech ( ...

LibriSpeech test-clean Benchmark (Speech Recognition)

Rank Model Word Error Rate (WER) Year 2 w2v‑BERT XXL 1.4 2021 3 Conv + Transformer + wav2vec2.0 + pseudo labeling 1.5 2020 5 SpeechStew (1B)...

Performance Evaluation of Offline Speech - ProQuest

On the Jetson Nano GPU, the inference latency is three to five times better, compared to Raspberry Pi. The word error rate on...

Shrinking Bigfoot: Reducing wav2vec 2.0 footprint - arXiv Vanity

With only 10 minutes of labeled data, wav2vec 2.0 achieves word error rates (WER) of 4.8% and 8.2 % on LibriSpeech [librispeech] dev-clean...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

I tried to run a librispeech transformer recipe with 8 GPU but a word error rate remains very large.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Incorrect use of torchaudio's rnnt_loss

AttributeError: 'function' object has no attribute 'keys'