question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wav2Vec 2.0 pretraining stuck

See original GitHub issue

❓ Questions and Help

What is your question?

Why loss and accuracy plots look like this? image image

Code

Running hydra train as it’s shown in readme.md for w2v2 base. Changes to config: restore_file: /root/wav2vec_small.pt
reset_dataloader: true
reset_lr_scheduler: true
reset_meters: true
reset_optimizer: true
num_workers: 24 max_tokens: 1500000
ddp_backend: no_c10d

I’ve parsed hydra_train.log file for my pretraining then plotted it.

What have you tried?

.

What’s your environment?

  • fairseq Version (e.g., 1.0 or master): master
  • PyTorch Version (e.g., 1.0): 1.9.0a0+c3d40fd
  • OS (e.g., Linux): Ubuntu 20.04 LTS
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): pip install --editable ./
  • Python version: 3.8.10
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: RTX 3090
  • Any other relevant information: It seems like this strange shape occurs once loss_1 and loss_2 become 0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12

github_iconTop GitHub Comments

2reactions
jubick1337commented, Jun 30, 2021

Went on and checked stable version 0.10.2 Process seems to be the same Loss freezes at 6.658 in training phase Although validation is quite nice and validation loss goes to low values as well as accuracy (about 0.02 and 0.99) Can someone explain this strange thing? Is it ok to use checkpoint like this? P.S. valid set to be 0.01 of original dataset

1reaction
bmildecommented, Feb 15, 2022

Hi @Etoye,

I managed to solve it. I had to reduce the learning rate and also introduce gradient accumulation (simulating multiple GPUs). Specifically, in the optimization part of the base model config I have:

optimization: max_update: 400000 lr: [0.0002] update_freq: [8]

I was able to train a base model successfully with this on just one GPU with fp16. I think the default learning rate was 0.0005.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wav2Vec 2.0 pretraining stuck · Issue #3661 - GitHub
Running hydra train as it's shown in readme.md for w2v2 base. ... I've parsed hydra_train.log file for my pretraining then plotted it. What...
Read more >
arXiv:2107.13530v2 [eess.AS] 7 Feb 2022
Our model can decrease pre-training times by 32% when learning a new language task, and learn this new audio- language representation without ...
Read more >
Running A Wav2Vec Model With 1 Billion Parameters For ...
In this video I will be testing a huge pretrained model for Speech ... mainly the memory allocations seems to have some kind...
Read more >
Shrinking Bigfoot: Reducing wav2vec 2.0 footprint - arXiv Vanity
Neural network-based speech recognition models achieve superior performances, but successfully training them requires a lot of labeled data [amodei2015deep] .
Read more >
wav2vec-2.0 - File Exchange - MATLAB Central - MathWorks
This repo enables you to load the pretrained wav2vec 2.0 baseline 960 hours model into MATLAB and perform speech-to-text transcription [1].
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found