Data type error while fine-tuning Deberta v3 Large using code provided
See original GitHub issueEnvironment info
transformers
version: 4.13.0.dev0- Platform: Ubuntu 18.04
- Python version: Python 3.6.9
- PyTorch version (GPU?): 1.11.0.dev20211110+cu111
- Tensorflow version (GPU?): 2.6.2
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet …): microsoft/deberta-v3-large
The problem arises when using:
- the official example scripts: (give details below): https://huggingface.co/microsoft/deberta-v3-large#fine-tuning-with-hf-transformers
- [] my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name) mnli
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
- go to transformers/examples/pytorch/text-classification/
- Run -
python3 run_glue.py --model_name_o r_path microsoft/deberta-v3-large --task_name mnli --do_train --do_eval --evaluation_strategy steps --max_seq_length 25 6 --warmup_steps 50 --learning_rate 6e-5 --num_train_epochs 3 --output_dir outputv3 --overwrite_output_dir --logging_ steps 10000 --logging_dir outputv3/
or run the script given in the model card - https://huggingface.co/microsoft/deberta-v3-large#fine-tuning-with-hf-transformers
Expected behavior
Training of microsoft/deberta-v3-large on the mnli dataset.
The error I am getting- Traceback (most recent call last): File “run_glue.py”, line 568, in <module> main() File “run_glue.py”, line 486, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File “/home/nikhil/.local/lib/python3.6/site-packages/transformers/trainer.py”, line 1316, in train tr_loss_step = self.training_step(model, inputs) File “/home/nikhil/.local/lib/python3.6/site-packages/transformers/trainer.py”, line 1867, in training_step loss.backward() File “/home/nikhil/.local/lib/python3.6/site-packages/torch/_tensor.py”, line 352, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File “/home/nikhil/.local/lib/python3.6/site-packages/torch/autograd/init.py”, line 175, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass File “/home/nikhil/.local/lib/python3.6/site-packages/torch/autograd/function.py”, line 199, in apply return user_fn(self, *args) File “/home/nikhil/.local/lib/python3.6/site-packages/transformers/models/deberta_v2/modeling_deberta_v2.py”, line 114, in backward inputGrad = _softmax_backward_data(grad_output, output, self.dim, output) TypeError: _softmax_backward_data(): argument ‘input_dtype’ (position 4) must be torch.dtype, not Tensor 0%|
I am also getting the same error when trying to train Deberta-v2
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (3 by maintainers)
Fourth argument of _softmax_backward_data is now torch.dtype.
https://github.com/pytorch/pytorch/blob/a34d2849cd3d39c2ce912402bfd90aea75162d1f/tools/autograd/derivatives.yaml#L1852
Changing
inputGrad = _softmax_backward_data(grad_output, output, self.dim, output)
toinputGrad = _softmax_backward_data(grad_output, output, self.dim, output.dtype)
seems to work.Hello @NIKHILDUGAR, thanks for opening an issue! I’m trying to get the same error as you but I’m failing at doing so: the training runs correctly.
I wonder if it isn’t because you’re on the bleeding edge with a PyTorch dev version? We recommend using a PyTorch stable release as those are heavily tested in our CI. Do you get the same error when using PyTorch 1.10?