[BUG] DeBERTa has bad performance when using ZERO Stage-3 with continuous warnings "A module has unknown inputs or outputs type"
See original GitHub issueDescribe the bug DeBERTa has bad performance when using ZERO Stage-3 . stdout has continuous warnings
[stage3.py:104:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class
'torch.nn.parameter.Parameter'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before
or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered proper
ly.
To Reproduce Steps to reproduce the behavior:
- Official HF Accelerate
run_glue_no_trainer.py
script - Setting up DeepSpeed Zero-3 theough command
accelerate config
. The output config yaml:
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 1
offload_optimizer_device: none
zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false
- bash script to run the finetuning of
microsoft/deberta-v2-xlarge-mnli
on MRPC dataset using ZERO Stage-3.
#!/bin/bash
time accelerate launch /home/sourab/deepspeed-test/src/text-classification/run_glue_no_trainer.py \
--task_name "mrpc" \
--max_length 128 \
--model_name_or_path "microsoft/deberta-v2-xlarge-mnli" \
--output_dir "/home/sourab/deepspeed-test/glue/mrpc_deepspeed_stage3_accelerate" \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 1 \
--learning_rate 3.5e-6 \
--weight_decay 0.0 \
--max_grad_norm 1.0 \
--num_train_epochs 6 \
--num_warmup_steps 50 \
--with_tracking \
- Relevant output snippets. The first one shows the weird behaviour with continuous warnings. The second shows the eval metrics being worse when compared to setup without using DeepSpeed.
Expected behavior A clear and concise description of what you expected to happen. No contiguous stream of warnings and no performance degradation when using DeepSpeed Stage-3 with DeBERTa.
ds_report output
Please run ds_report
to give us details about your setup.
-------------------------------------------------- [0/1948]
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sourab/dev/lib/python3.8/site-packages/torch']
torch version .................... 1.12.0.dev20220505+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 10.2
deepspeed install path ........... ['/home/sourab/dev/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.6.4, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 20.04.3 LTS (Focal Fossa)
- GPU count and types: 1 machine with x2 NVIDIA TITAN RTX each
- Python version: Python 3.8.10
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Accelerate launcher which just triggers deepspeed
launcher
Issue Analytics
- State:
- Created a year ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
DeepSpeed Integration - Hugging Face
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...
Read more >quadrature method dqm: Topics by Science.gov
A quadrature method that is especially suitable and that has been employed for such equations is one based on the trepezoidal rule that...
Read more >Why, How and Where of Delays in Software Security Patch ...
Consequently, it has resulted in most security attacks targeting known vulnerabilities for which a patch existed but delayed application. Despite.
Read more >16.pdf - National Academic Digital Library of Ethiopia
IAENG is a non-profit international association for the engineers and the computer scientists, which was found originally in 1968 and has been.
Read more >naacl-hlt 2022 - ACL Anthology
The industry track was first introduced to a major NLP conference at NAACL-HLT 2018 in New Orleans and has since become a standard...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pacman100, thanks for sharing your update. I am glad that performance problem is resolved in the latest code. I have created this #1974 to suppress the warning noise. The PR probably needs tweaking such as whether to report this warning some fixed number of times. Right now, it is complete turned off except for debugging mode. Can you please test the PR branch?
Hello @tjruwase, Thank you for the fix 😄! Yes, the above PR is working as expected to suppress the warnings.