RuntimeError: while running run_common_voice.py (XLSR wav2vec finetuning week)
See original GitHub issueEnvironment info
transformers
version: 4.5.0.dev0 (I tried running it on 4.4.0 as well, gave the same error)- Platform: Ubuntu (running on a virtual machine)
- Python version: 3.8
- PyTorch version (GPU?): 1.6.0
- Using GPU in script?: yes, running this script
- Using distributed or parallel set-up in script?: Distributed
Who can help
@patrickvonplaten (as per the message on slack group)
Information
Model I am using (Bert, XLNet …):
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
Tried running both official command and modified script (running command changed based on the language)
The tasks I am working on is
- common voice dataset (ta)
To reproduce
Steps to reproduce the behavior:
- run common voice script from here
- For multi-gpu setup I used this command
python -m torch.distributed.launch \ --nproc_per_node 4 run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ # use this argument to specify the language code --output_dir=./wav2vec2-large-xlsr-turkish-demo \ --overwrite_output_dir \ --num_train_epochs="5" \ --per_device_train_batch_size="16" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --save_steps="400" \ --eval_steps="400" \ --logging_steps="400" \ --save_total_limit="3" \ --freeze_feature_extractor \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --gradient_checkpointing \ --fp16 \ --group_by_length \ --do_train --do_eval
Error:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument 'find_unused_parameters=True' to 'torch.nn.parallel.DistributedDataParallel'; (2) making sure all 'forward' function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's 'forward' function. Please include the loss function and the structure of the return value of 'forward' of your module when reporting this issue (e.g. list, dict, iterable).
Expected behavior
Model would train without any error
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:5
Top GitHub Comments
@raja1196 I think I have found the bug. Could you try modifying in run_common_voice.py the gradient_checkpointing to False, as it is written below:
And then running the script without gradient_checkpointing as follows:
python -m torch.distributed.launch \ --nproc_per_node 4 run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ # use this argument to specify the language code --output_dir=./wav2vec2-large-xlsr-turkish-demo \ --overwrite_output_dir \ --num_train_epochs="5" \ --per_device_train_batch_size="16" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --save_steps="400" \ --eval_steps="400" \ --logging_steps="400" \ --save_total_limit="3" \ --freeze_feature_extractor \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --fp16 \ --group_by_length \ --do_train --do_eval
This solves the problem in my case and now I am able to run it with two GPUs. If it works to you, I will do PR
I am experience this error too. CUDA 11.2 4xT4 - 16Gb
--dataset_config_name="ru"