Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[wav2vec] deepspeed eval bug in the case of >1 gpus

See original GitHub issue

Environment info

transformers version: 4.5.1
Platform: Linux-4.15.0-140-generic-x86_64-with-debian-buster-sid
Python version: 3.7.9
PyTorch version (GPU?): 1.8.1 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: <2,4>
Using distributed or parallel set-up in script?: <distributed>

Who can help

@stas00 @patrickvonplaten @patil-suraj

Information

I’m working on wav2vec2.0 using the following official script of huggingface. https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_common_voice.py

I am trying to finetune huggingface model with multiple gpus using deepspeed.

deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

works, but

deepspeed --num_gpus=2 run_common_voice.py --deepspeed ds_config.json --do_train --do_eval

stops working and freezes at the end of eval. The progress bar is 100% done but the eval result is not returned and it freezes.

To reproduce

This is how to reproduce! https://colab.research.google.com/drive/1VRCGcnhBlrMFYQ5aaNebucZuja-WB2I2?usp=sharing Steps to reproduce the behavior:

Install deepspeed
Add with autocast(): after line 481 in run_common_voice.py
Set param: --deepspeed ds_config.json --do_train --do_eval
Run run_common_voice.py using deepspeed with 1> gpus

ds_config has the following parameters.

{
  "fp16": {
    "enabled": "true",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1,
    "opt_level": "O3"
  },
  "steps_per_print": 100,
  "wall_clock_breakdown": "false"
}

Expected behavior

The finetuning eval should be executed without freezing.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

stas00commented, May 8, 2021

You’re welcome to follow my progress at fixing this issue at https://github.com/huggingface/transformers/pull/11638

ZeRO-2 works fully. ZeRO-3 still has one issue, but fp32 works.

Do try and let me know if you run into any problems.

1reaction

stas00commented, Apr 27, 2021

OK, this is a new type of model that requires a special type of handling.

The NLP models get long inputs which get converted to the same dtype as the embedding weights, which under deepspeed/fp16 are float16. Currently deepspeed does model.half.

This model however receives inputs that are float32 and it doesn’t check whether the model weights are fp16 or not. Hence the error.

So this is one way to fix it:

diff --git a/src/transformers/models/wav2vec2/modeling_wav2vec2.py b/src/transformers/models/wav2vec2/modeling_wav2vec2.py
index 98123bdd3..639c2bc13 100755
--- a/src/transformers/models/wav2vec2/modeling_wav2vec2.py
+++ b/src/transformers/models/wav2vec2/modeling_wav2vec2.py
@@ -153,7 +153,7 @@ class Wav2Vec2LayerNormConvLayer(nn.Module):
         self.activation = ACT2FN[config.feat_extract_activation]

     def forward(self, hidden_states):
-        hidden_states = self.conv(hidden_states)
+        hidden_states = self.conv(hidden_states.to(dtype=self.conv.weight.dtype))

         hidden_states = hidden_states.transpose(-2, -1)
         hidden_states = self.layer_norm(hidden_states)

The test I was using is:

CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus=1 \
examples/research_projects/wav2vec2/run_common_voice.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \
--output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \
--per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \
--evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \
--save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \
--gradient_checkpointing --fp16 --group_by_length --do_train --do_eval --deepspeed \
tests/deepspeed/ds_config_zero2.json

Could probably move it to the top-level layer so it’d work in all cases, if this exact path isn’t always taken.

So this overcomes:

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

but now running into:

  File "examples/research_projects/wav2vec2/run_common_voice.py", line 512, in <module>
    main()
  File "examples/research_projects/wav2vec2/run_common_voice.py", line 484, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1240, in train
    tr_loss += self.training_step(model, inputs)
  File "examples/research_projects/wav2vec2/run_common_voice.py", line 232, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/trainer.py", line 1667, in compute_loss
    outputs = model(**inputs)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 942, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1015, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/nvme1/code/huggingface/transformers-master/src/transformers/models/wav2vec2/modeling_wav2vec2.py", line 1076, in forward
    loss = F.ctc_loss(
  File "/home/stas/anaconda3/envs/py38-pt18/lib/python3.8/site-packages/torch/nn/functional.py", line 2436, in ctc_loss
    return torch.ctc_loss(
RuntimeError: "ctc_loss_cuda" not implemented for 'Half'

so need to look more to see what to do there, probably need to switch to float32 just for that op.

However, it appears that may be this model can’t be trained/eval’ed in fp16/mixed precision?

When I run:

CUDA_VISIBLE_DEVICES=0 python examples/research_projects/wav2vec2/run_common_voice.py \
--model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="tr" \
--output_dir=./wav2vec2-large-xlsr-turkish-demo --overwrite_output_dir --num_train_epochs="5" \
--per_device_train_batch_size="16" --learning_rate="3e-4" --warmup_steps="500" \
--evaluation_strategy="steps" --save_steps="5" --eval_steps="5" --logging_steps="5" \
--save_total_limit="3" --freeze_feature_extractor --feat_proj_dropout="0.0" --layerdrop="0.1" \
--gradient_checkpointing --fp16 --group_by_length --do_train --do_eval

I see:

{'loss': nan, 'learning_rate': 4.2e-06, 'epoch': 0.05}

We have multiple models that won’t train under fp16-mixed precision, because they were pretrained in bfloat16 which doesn’t lend to fp16 numerical range.

Deepspeed devs are working on adding the fp32 mode (next release hopefully). https://github.com/microsoft/DeepSpeed/pull/1004

p.s. please don’t mix amp with running modes that don’t use amp (deepspeed is one of them)

Top Results From Across the Web

Eval freezes on local multi GPU Deepspeed run

I am trying to finetune huggingface model with multiple gpus using deepspeed. deepspeed --num_gpus=1 run_common_voice.py --deepspeed ds_config.

Accelerate GPT-J inference with DeepSpeed-Inference on GPUs

Learn how to optimize GPT-J for GPU inference with a 1-line of code using Hugging Face Transformers and DeepSpeed.

v4.5.0: BigBird, GPT Neo, Examples, Flax support | Zenodo

... Fix big bird gpu test #10967 (@patrickvonplaten) [Notebook] add ... Fixed finename for Saving null_odds in the evaluation stage in QA ...

Running out of memory with pytorch - Stack Overflow

From your existing model you might tell which layer sits on which gpu with .to('cuda:0') , .to('cuda:1') etc. class ModelParallelResNet50(ResNet): ...

Aman's AI Journal • Papers List

In the case where G and D are defined by multilayer perceptrons, the entire ... boxes and class probabilities directly from full images...