Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Wav2vec doesn't work with DDP

See original GitHub issue

hi @TParcollet , @mravanelli ,

I just notice that our Wav2vec code doesn’t work with DDP

(gpu-test2) aheba@koios:~/test_gpu/speechbrain-abdel/recipes/TIMIT/ASR/seq2seq$ python -m torch.distributed.launch --nproc_per_node=2 train_with_wav2vec2.py hparams/train_with_wav2vec2.yaml --distributed_launch --distributed_backend='nccl'
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************

speechbrain.core - Beginning experiment!
speechbrain.core - Experiment folder: results/augment_noise_CRDNN/1234
timit_prepare - Skipping preparation, completed in previous run.
speechbrain.dataio.encoder - Load called, but CTCTextEncoder is not empty. Loaded data will overwrite everything. This is normal if there is e.g. an unk label defined at init.
speechbrain.core - Info: auto_mix_prec arg overridden by command line input
speechbrain.core - 318.8M trainable parameters in ASR
speechbrain.utils.checkpoints - Loading a checkpoint from results/augment_noise_CRDNN/1234/save/CKPT+2021-04-28+17-45-44+00
speechbrain.utils.epoch_loop - Going into epoch 3
  0%|▉                                                                                                                                                                                                                         | 1/231 [00:00<03:27,  1.11it/s, train_loss=2.58]Traceback (most recent call last):
  File "train_with_wav2vec2.py", line 381, in <module>
  0%|▉                                                                                                                                                                                                                         | 1/231 [00:00<03:34,  1.07it/s, train_loss=2.58]
    asr_brain.fit(
  File "/home/chollet/test_gpu/speechbrain-abdel/speechbrain/core.py", line 1013, in fit
    loss = self.fit_batch(batch)
  File "train_with_wav2vec2.py", line 199, in fit_batch
    outputs = self.compute_forward(batch, sb.Stage.TRAIN)
  File "train_with_wav2vec2.py", line 40, in compute_forward
    feats = self.modules.wav2vec2(wavs)
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
speechbrain.core - Exception:
Traceback (most recent call last):
  File "train_with_wav2vec2.py", line 381, in <module>
    asr_brain.fit(
  File "/home/chollet/test_gpu/speechbrain-abdel/speechbrain/core.py", line 1013, in fit
    loss = self.fit_batch(batch)
  File "train_with_wav2vec2.py", line 199, in fit_batch
    outputs = self.compute_forward(batch, sb.Stage.TRAIN)
  File "train_with_wav2vec2.py", line 40, in compute_forward
    feats = self.modules.wav2vec2(wavs)
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 692, in forward
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
    if self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`; (2) making sure all `forward` function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Killing subprocess 336626
Killing subprocess 336627
Traceback (most recent call last):
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/chollet/anaconda3/envs/gpu-test2/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chollet/anaconda3/envs/gpu-test2/bin/python', '-u', 'train_with_wav2vec2.py', '--local_rank=1', 'hparams/train_with_wav2vec2.yaml', '--distributed_launch', '--distributed_backend=nccl']' returned non-zero exit status 1.

Issue Analytics

State:
Created 2 years ago
Comments:14

Top GitHub Comments

1reaction

TParcolletcommented, May 3, 2021

@Choiuijin1125 please use DP instead of DDP, it works well for now. @aheba is fixing this issue. Actually after discussing with HuggingFace, it appears that they are facing the same error ahah

0reactions

ahebacommented, May 3, 2021

Very strange behavior, I’m tracking the issue, and try to identify the unused params… https://discuss.pytorch.org/t/how-to-find-the-unused-parameters-in-network/63948/3

So, keep you updated on #713

Top Results From Across the Web

Wav2vec fine-tuning with multiGPU - Hugging Face Forums

I have some problems with training. Very slowly process.

Basics of multi-GPU — SpeechBrain 0.5.0 documentation

Multi-GPU training using Distributed Data Parallel (DDP) . DDP implements data parallelism on different processes. This way, the GPUs do not necessarily...

fairseq Users | hi! i have some questions, someone use wav2vec

I tried changing the update-freq and ddp-backend, but that didn't help ... issue while fine tuning the wav2vec model · Issue #3409 ·...

wav2vec 2.0: A Framework for Self-Supervised ... - NIPS papers

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the ......

How do I watch on my smart TV? - DDPY On Demand

Diamond Dallas Page originally developed DDP Yoga for athletes like ... In other words, it could work, but those browsers are not among...