question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RuntimeError: while running run_common_voice.py (XLSR wav2vec finetuning week)

See original GitHub issue

Environment info

  • transformers version: 4.5.0.dev0 (I tried running it on 4.4.0 as well, gave the same error)
  • Platform: Ubuntu (running on a virtual machine)
  • Python version: 3.8
  • PyTorch version (GPU?): 1.6.0
  • Using GPU in script?: yes, running this script
  • Using distributed or parallel set-up in script?: Distributed

Who can help

@patrickvonplaten (as per the message on slack group)

Information

Model I am using (Bert, XLNet …):

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

Tried running both official command and modified script (running command changed based on the language)

The tasks I am working on is

  • common voice dataset (ta)

To reproduce

Steps to reproduce the behavior:

  1. run common voice script from here
  2. For multi-gpu setup I used this command python -m torch.distributed.launch \ --nproc_per_node 4 run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ # use this argument to specify the language code --output_dir=./wav2vec2-large-xlsr-turkish-demo \ --overwrite_output_dir \ --num_train_epochs="5" \ --per_device_train_batch_size="16" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --save_steps="400" \ --eval_steps="400" \ --logging_steps="400" \ --save_total_limit="3" \ --freeze_feature_extractor \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --gradient_checkpointing \ --fp16 \ --group_by_length \ --do_train --do_eval

Error:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by (1) passing the keyword argument 'find_unused_parameters=True' to 'torch.nn.parallel.DistributedDataParallel'; (2) making sure all 'forward' function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's 'forward' function. Please include the loss function and the structure of the return value of 'forward' of your module when reporting this issue (e.g. list, dict, iterable).

Expected behavior

Model would train without any error

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:5

github_iconTop GitHub Comments

1reaction
ivangtorrecommented, Mar 25, 2021

@raja1196 I think I have found the bug. Could you try modifying in run_common_voice.py the gradient_checkpointing to False, as it is written below:

gradient_checkpointing: Optional[bool] = field(
        default=False,
        metadata={
            "help": "If True, use gradient checkpointing to save memory at the expense of slower backward pass."
        },
    )

And then running the script without gradient_checkpointing as follows:

python -m torch.distributed.launch \ --nproc_per_node 4 run_common_voice.py \ --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \ --dataset_config_name="tr" \ # use this argument to specify the language code --output_dir=./wav2vec2-large-xlsr-turkish-demo \ --overwrite_output_dir \ --num_train_epochs="5" \ --per_device_train_batch_size="16" \ --learning_rate="3e-4" \ --warmup_steps="500" \ --evaluation_strategy="steps" \ --save_steps="400" \ --eval_steps="400" \ --logging_steps="400" \ --save_total_limit="3" \ --freeze_feature_extractor \ --feat_proj_dropout="0.0" \ --layerdrop="0.1" \ --fp16 \ --group_by_length \ --do_train --do_eval

This solves the problem in my case and now I am able to run it with two GPUs. If it works to you, I will do PR

1reaction
Gorodeckicommented, Mar 25, 2021

I am experience this error too. CUDA 11.2 4xT4 - 16Gb --dataset_config_name="ru"

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week ...
We organize a community week (Mar 22th to Mar 29th) to fine-tune the cross-lingual speech recognition model XLSR-Wav2Vec2 on all languages ...
Read more >
transformers/FINE_TUNE_XLSR_WAV2VEC2.md at main
Fine-Tuning week of XLSR-Wav2Vec2 on 60 languages ... We have provided run_common_voice.py script to run fine-tuning on local machine.
Read more >
XLSR Wav2Vec2 Fine-Tuning Week - HackMD
Try using a delimiter which is not present in the dataset text. Caching bug in HuggingFace datasets: In some cases, the cached version...
Read more >
Fine-tuning XLSR-Wav2Vec2 for WOLOF ASR with | Kaggle
Audio preprocessing and finetuning using wav2vec2-large-xlsr model on AI4D Baamtu Datamation - automatic speech recognition in WOLOF data.
Read more >
Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with :câlin
This video will explain in -detail how to fine-tune a multi-lingual Wav2Vec2 model on any dataset of Common Voice. It is a walkthrough...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found