Wav2vec 2.0 fine-tuning failure
See original GitHub issueš Bug
Iām trying to fine-tune the pre-trained wav2vec 2.0 model using fairseq-hydra-train
(according to the guide and getting a RuntimeError which says that DistributedDataParallel wasnāt properly initialized.
To Reproduce
- Run cmd
fairseq-hydra-train \
distributed_training.distributed_port=12345 \
task.data=/home/ppavlov/wav2vec \
model.w2v_path=/home/ppavlov/wav2vec/wav2vec_small.pt \
distributed_training.distributed_world_size=2 \
+optimization.update_freq='[12]' \
--config-dir fairseq/examples/wav2vec/config/finetuning \
--config-name base_1h
- See error
Traceback (most recent call last):
File "/home/ppavlov/wav2vec/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/home/ppavlov/wav2vec/fairseq/fairseq/distributed/utils.py", line 369, in call_main
main(cfg, **kwargs)
File "/home/ppavlov/wav2vec/fairseq/fairseq_cli/train.py", line 128, in main
trainer = Trainer(cfg, task, model, criterion, quantizer)
File "/home/ppavlov/wav2vec/fairseq/fairseq/trainer.py", line 144, in __init__
if self.data_parallel_rank == 0:
File "/home/ppavlov/wav2vec/fairseq/fairseq/trainer.py", line 177, in data_parallel_rank
return distributed_utils.get_data_parallel_rank()
File "/home/ppavlov/wav2vec/fairseq/fairseq/distributed/utils.py", line 463, in get_data_parallel_rank
return get_rank(get_data_parallel_group())
File "/home/ppavlov/wav2vec/fairseq/fairseq/distributed/utils.py", line 405, in get_rank
return dist.get_rank(group=group)
File "/home/ppavlov/wav2vec/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 688, in get_rank
default_pg = _get_default_group()
File "/home/ppavlov/wav2vec/venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 347, in _get_default_group
raise RuntimeError("Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Environment
- fairseq Version: master
- PyTorch Version: 1.8.1
- OS: Ubuntu 18.04
- How you installed fairseq: pip install --editable .
- Python version: 3.8
- CUDA/cuDNN version: 11.2
- GPU models and configuration: 2xTesla T4
Working directory content
- data/
- dev_other.ltr
- dev_other.tsv
- dev_other.wrd
- dict.ltr.txt
- fairseq/
- outputs/
- train.ltr
- train.tsv
- train.wrd
- venv/
- wav2vec_small.pt
Issue Analytics
- State:
- Created 2 years ago
- Comments:5
Top Results From Across the Web
Wav2vec 2.0 fine-tuning failure Ā· Issue #3451 - GitHub
Bug I'm trying to fine-tune the pre-trained wav2vec 2.0 model using fairseq-hydra-train (according to the guide and getting a RuntimeErrorĀ ...
Read more >Fine-Tune Wav2Vec2 for English ASR with Transformers
Wav2Vec2 is fine-tuned using Connectionist Temporal Classification (CTC), which is an algorithm that is used to train neural networks forĀ ...
Read more >Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
In this notebook, we will load the pre-trained wav2vec2 model from TFHub and will fine-tune it on LibriSpeech dataset by appending Language Modeling...
Read more >analyzing domain shift in self-supervised pre-training - arXiv
Finally, we pre-train a single large wav2vec 2.0 model with 300M parameters [6] on three domains (LL, SF and CV) for 800K steps...
Read more >Comparing CTC and LFMMI for out-of-domain adaptation of ...
Fine-tuning the wav2vec 2.0 model with E2E-LFMMI and CTC we obtain the ... Table 1: Comparison of word error rates (WER) (in %)...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry, Iāve found the solution: distributed_training.distributed_port in my case (single node) is redundant.
Yes, it would be really great if the documentation was more comprehensive.