Wav2vec doesn't work with DDP
See original GitHub issuehi @TParcollet , @mravanelli ,
I just notice that our Wav2vec code doesn’t work with DDP
(gpu-test2) aheba@koios:~/test_gpu/speechbrain-abdel/recipes/TIMIT/ASR/seq2seq$ python -m torch.distributed.launch --nproc_per_node=2 train_with_wav2vec2.py hparams/train_with_wav2vec2.yaml --distributed_launch --distributed_backend='nccl'
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/augment_noise_CRDNN/1234
timit_prepare - Skipping preparation, completed in previous run.
speechbrain.dataio.encoder - Load called, but CTCTextEncoder is not empty. Loaded data will overwrite everything. This is normal if there is e.g. an unk label defined at init.
speechbrain.core - Info: auto_mix_prec arg overridden by command line input
speechbrain.core - 318.8M trainable parameters in ASR
speechbrain.utils.checkpoints - Loading a checkpoint from results/augment_noise_CRDNN/1234/save/CKPT+2021-04-28+17-45-44+00
speechbrain.utils.epoch_loop - Going into epoch 3
0%|▉ | 1/231 [00:00<03:27, 1.11it/s, train_loss=2.58]Traceback (most recent call last):
File "train_with_wav2vec2.py", line 381, in <module>
0%|▉ | 1/231 [00:00<03:34, 1.07it/s, train_loss=2.58]
asr_brain.fit(
File "/home/chollet/test_gpu/speechbrain-abdel/speechbrain/core.py", line 1013, in fit
loss = self.fit_batch(batch)
File "train_with_wav2vec2.py", line 199, in fit_batch
outputs = self.compute_forward(batch, sb.Stage.TRAIN)
File "train_with_wav2vec2.py", line 40, in compute_forward
feats = self.modules.wav2vec2(wavs)
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
speechbrain.core - Exception:
Traceback (most recent call last):
File "train_with_wav2vec2.py", line 381, in <module>
asr_brain.fit(
File "/home/chollet/test_gpu/speechbrain-abdel/speechbrain/core.py", line 1013, in fit
loss = self.fit_batch(batch)
File "train_with_wav2vec2.py", line 199, in fit_batch
outputs = self.compute_forward(batch, sb.Stage.TRAIN)
File "train_with_wav2vec2.py", line 40, in compute_forward
feats = self.modules.wav2vec2(wavs)
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Killing subprocess 336626
Killing subprocess 336627
Traceback (most recent call last):
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chollet/anaconda3/envs/gpu-test2/bin/python', '-u', 'train_with_wav2vec2.py', '--local_rank=1', 'hparams/train_with_wav2vec2.yaml', '--distributed_launch', '--distributed_backend=nccl']' returned non-zero exit status 1.
Issue Analytics
- State:
- Created 2 years ago
- Comments:14
Top Results From Across the Web
Wav2vec fine-tuning with multiGPU - Hugging Face Forums
I have some problems with training. Very slowly process.
Read more >Basics of multi-GPU — SpeechBrain 0.5.0 documentation
Multi-GPU training using Distributed Data Parallel (DDP) . DDP implements data parallelism on different processes. This way, the GPUs do not necessarily...
Read more >fairseq Users | hi! i have some questions, someone use wav2vec
I tried changing the update-freq and ddp-backend, but that didn't help ... issue while fine tuning the wav2vec model · Issue #3409 ·...
Read more >wav2vec 2.0: A Framework for Self-Supervised ... - NIPS papers
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the ......
Read more >How do I watch on my smart TV? - DDPY On Demand
Diamond Dallas Page originally developed DDP Yoga for athletes like ... In other words, it could work, but those browsers are not among...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@Choiuijin1125 please use DP instead of DDP, it works well for now. @aheba is fixing this issue. Actually after discussing with HuggingFace, it appears that they are facing the same error ahah
Very strange behavior, I’m tracking the issue, and try to identify the unused params… https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/3
So, keep you updated on #713