question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimum's DeBERTa-V2 behavior strange when training with ORT (training hangs or takes impossibly long)

See original GitHub issue

System Info

Running with CUDA 11.5, Python 3.8, and torch 11.11. I installed the Python dependencies from requirements.txt in the text-classification example folder. I installed transformers from source, and tried running with Optimum from source as well as pip installing Optimum, and got the same results for both.

Running in Ubuntu image on a VM with 8 V100 GPUs.

Who can help?

@JingyaHuang @echarlaix

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

After properly setting up the environment, I run the following:

python -m torch.distributed.run --nproc_per_node=8 run_glue.py --model_name_or_path microsoft/deberta-v2-xxlarge --task_name MRPC --do_train --max_seq_length 128 --per_device_train_batch_size 1 --learning_rate 3e-6 --max_steps 8000 --output_dir /tmp/deberta_res --overwrite_output_dir --logging_steps 8000 --fp16 --sharded_ddp simple --num_train_epochs 1

It downloads & tokenizes the dataset, then when it begins setting up ONNX/gets to the line that trains the ORTTrainer, it hangs for around 7 minutes 40 seconds (give or take 5 seconds) with no terminal output and GPU utilization at 0. After that wait, it continues as per usual, but trains very slowly and with a lot of terminal output logs about the ONNX graph. The terminal output is being printed so fast that it’s hard for me to read the messages and there’s no status bar visible for training progress. I let it train for over 4 days, and it still hadn’t finished.

I ran the same arguments on the corresponding examples run_glue.py script from the Transformers repository without adding the Optimum ORTTrainer, and it finished training within an hour – it also did not print out any terminal output beyond the expected status bars and warnings.

Finally, I tried modifying the examples run_glue.py script from the Transformers repository to add the Optimum ORTTrainer, and it printed a lot of terminal output with the ONNX graph information, such that the status bar if it was printed was obscured.

I did not run into any error messages, just strange behavior with the training hanging, the logs, and the unnaturally long training time.

Thanks for your time! Please let me know if I set up my environment incorrectly etc.

Expected behavior

Trains successfully – I ran the corresponding examples run_glue.py script from the Transformers repository with the same arguments and it finished training within the hour.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
JingyaHuangcommented, Aug 22, 2022

The fix has been merged in transformers, closing the issue.

1reaction
zhijxu-MScommented, Aug 9, 2022

@JingyaHuang @askhade this branch can run in my side.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Overtraining: What It Is, Symptoms, and Recovery - HSS
Overreaching is muscle soreness above and beyond what you typically experience that occurs when you don't sufficiently recover between workouts.
Read more >
Why Can't I Sleep After a Hard Workout or Race? - CTS
Adrenaline levels fall quickly after exercise, but according to a 2011 study by Shahsavar norepinephrine levels may stay elevated for up to 48...
Read more >
If You Stopped Exercising Today, Here's How Long It ... - Forbes
For cardio, it takes less time to break down your fitness; as stated above, it could take years to lose all your muscle....
Read more >
Principles of Exercise - PT Direct
Optimal adaptation requires rest periods to be interspersed with training sessions sufficient that the adaptations caused by the exercise dose can take place....
Read more >
How Long Does It Take for Muscles to Heal? - Greatist
This discomfort is often called DOMS (delayed-onset muscle soreness), and it's why a minimum of 24 hours of rest is optimal after a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found