~7% drop in performance is noticed for huggingface GPT2 model
See original GitHub issueSystem Info
platform: ROCm AMD device python version: 3.7.13
There is a ~7% drop in performance noticed for huggingface GPT2 model after the IFU (https://github.com/ROCmSoftwarePlatform/transformers/pull/15) on https://github.com/ROCmSoftwarePlatform/transformers repository.
@patil-suraj, @patrickvonplaten, could you please help me in finding the change in transformers that is responsible for the drop in performance?
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, …) - My own task or dataset (give details below)
Reproduction
Command used to run the model:
python3 -m torch.distributed.launch --nproc_per_node=8 transformers/examples/pytorch/language-modeling/run_clm.py --output_dir output --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --label_smoothing 0.1 --logging_steps 1 --logging_dir log --fp16 --dataloader_num_workers 1 --skip_memory_metrics --per_device_train_batch_size=8 --overwrite_output_dir --max_steps 150
Expected behavior
I was expecting to see similar or better performance of the model after IFU on Aug 9, 2022.
I also tried with the recent commits after Aug 9, 2022. Those seem to worsen the performance much more.
Issue Analytics
- State:
- Created a year ago
- Comments:10 (4 by maintainers)
Top GitHub Comments
@rraminen , I dont think upstream HF can help much here; this is on AMD to root-cause. Please close this ticket.
Let’s get started with figuring out which commit caused the regression on ROCm, and tracking internally.
Just seconding what @LysandreJik said: if we can help in any way to improve support or performance of our software on AMD chips, we’d like to help
Just ping us