question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[wav2vec] deepspeed eval bug in the case of >1 gpus

See original GitHub issue

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-4.15.0-140-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.8.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: <2,4>
  • Using distributed or parallel set-up in script?: <distributed>

Who can help

@stas00 @patrickvonplaten @patil-suraj

Information

I’m working on wav2vec2.0 using the following official script of huggingface. https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py

I am trying to finetune huggingface model with multiple gpus using deepspeed.

deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

works, but

deepspeed --num_gpus=2 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

stops working and freezes at the end of eval. The progress bar is 100% done but the eval result is not returned and it freezes.

To reproduce

This is how to reproduce! https://colab.research.google.com/drive/1VRCGcnhBlrMFYQ5aaNebucZuja-WB2I2?usp=sharing Steps to reproduce the behavior:

  1. Install deepspeed
  2. Add with autocast(): after line 481 in run_common_voice.py
  3. Set param: --deepspeed ds_config.json --do_train --do_eval
  4. Run run_common_voice.py using deepspeed with 1> gpus

ds_config has the following parameters.

{
  "fp16": {
    "enabled": "true",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "steps_per_print": 100,
  "wall_clock_breakdown": "false"
}

Expected behavior

The finetuning eval should be executed without freezing.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, May 8, 2021

You’re welcome to follow my progress at fixing this issue at https://github.com/huggingface/transformers/pull/11638

ZeRO-2 works fully. ZeRO-3 still has one issue, but fp32 works.

Do try and let me know if you run into any problems.

1reaction
stas00commented, Apr 27, 2021

OK, this is a new type of model that requires a special type of handling.

The NLP models get long inputs which get converted to the same dtype as the embedding weights, which under deepspeed/fp16 are float16. Currently deepspeed does model.half.

This model however receives inputs that are float32 and it doesn’t check whether the model weights are fp16 or not. Hence the error.

So this is one way to fix it:

diff --git a/src/transformers/models/wav2vec2/modeling_wav2vec2.py b/src/transformers/models/wav2vec2/modeling_wav2vec2.py
index 98123bdd3..639c2bc13 100755
--- a/src/transformers/models/wav2vec2/modeling_wav2vec2.py
+++ b/src/transformers/models/wav2vec2/modeling_wav2vec2.py
@@ -153,7 +153,7 @@ class Wav2Vec2LayerNormConvLayer(nn.Module):
         self.activation = ACT2FN[config.feat_extract_activation]

     def forward(self, hidden_states):
-        hidden_states = self.conv(hidden_states)
+        hidden_states = self.conv(hidden_states.to(dtype=self.conv.weight.dtype))

         hidden_states = hidden_states.transpose(-2, -1)
         hidden_states = self.layer_norm(hidden_states)

The test I was using is:

CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 \
examples/research_projects/wav2vec2/run_common_voice.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \
--output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \
--per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \
--evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \
--save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \
--gradient_checkpointing --fp16 --group_by_length --do_train --do_eval --deepspeed \
tests/deepspeed/ds_config_zero2.json

Could probably move it to the top-level layer so it’d work in all cases, if this exact path isn’t always taken.

So this overcomes:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

but now running into:

  File "examples/research_projects/wav2vec2/run_common_voice.py", line 512, in <module>
    main()
  File "examples/research_projects/wav2vec2/run_common_voice.py", line 484, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1240, in train
    tr_loss += self.training_step(model, inputs)
  File "examples/research_projects/wav2vec2/run_common_voice.py", line 232, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1667, in compute_loss
    outputs = model(**inputs)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 942, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1076, in forward
    loss = F.ctc_loss(
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/functional.py", line 2436, in ctc_loss
    return torch.ctc_loss(
RuntimeError: "ctc_loss_cuda" not implemented for 'Half'

so need to look more to see what to do there, probably need to switch to float32 just for that op.

However, it appears that may be this model can’t be trained/eval’ed in fp16/mixed precision?

When I run:

CUDA_VISIBLE_DEVICES=0 python examples/research_projects/wav2vec2/run_common_voice.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \
--output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \
--per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \
--evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \
--save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \
--gradient_checkpointing --fp16 --group_by_length --do_train --do_eval

I see:

{'loss': nan, 'learning_rate': 4.2e-06, 'epoch': 0.05}  

We have multiple models that won’t train under fp16-mixed precision, because they were pretrained in bfloat16 which doesn’t lend to fp16 numerical range.

Deepspeed devs are working on adding the fp32 mode (next release hopefully). https://github.com/microsoft/DeepSpeed/pull/1004

p.s. please don’t mix amp with running modes that don’t use amp (deepspeed is one of them)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Eval freezes on local multi GPU Deepspeed run
I am trying to finetune huggingface model with multiple gpus using deepspeed. deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.
Read more >
Accelerate GPT-J inference with DeepSpeed-Inference on GPUs
Learn how to optimize GPT-J for GPU inference with a 1-line of code using Hugging Face Transformers and DeepSpeed.
Read more >
v4.5.0: BigBird, GPT Neo, Examples, Flax support | Zenodo
... Fix big bird gpu test #10967 (@patrickvonplaten) [Notebook] add ... Fixed finename for Saving null_odds in the evaluation stage in QA ...
Read more >
Running out of memory with pytorch - Stack Overflow
From your existing model you might tell which layer sits on which gpu with .to('cuda:0') , .to('cuda:1') etc. class ModelParallelResNet50(ResNet): ...
Read more >
Aman's AI Journal • Papers List
In the case where G and D are defined by multilayer perceptrons, the entire ... boxes and class probabilities directly from full images...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found