question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

~7% drop in performance is noticed for huggingface GPT2 model

See original GitHub issue

System Info

platform: ROCm AMD device python version: 3.7.13

There is a ~7% drop in performance noticed for huggingface GPT2 model after the IFU (https://github.com/ROCmSoftwarePlatform/transformers/pull/15) on https://github.com/ROCmSoftwarePlatform/transformers repository.

@patil-suraj, @patrickvonplaten, could you please help me in finding the change in transformers that is responsible for the drop in performance?

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

Command used to run the model:

python3 -m torch.distributed.launch --nproc_per_node=8 transformers/examples/pytorch/language-modeling/run_clm.py --output_dir output --model_name_or_path gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --label_smoothing 0.1 --logging_steps 1 --logging_dir log --fp16 --dataloader_num_workers 1 --skip_memory_metrics --per_device_train_batch_size=8 --overwrite_output_dir --max_steps 150

Expected behavior

I was expecting to see similar or better performance of the model after IFU on Aug 9, 2022.

I also tried with the recent commits after Aug 9, 2022. Those seem to worsen the performance much more.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
amathews-amdcommented, Oct 13, 2022

@rraminen , I dont think upstream HF can help much here; this is on AMD to root-cause. Please close this ticket.

Let’s get started with figuring out which commit caused the regression on ROCm, and tracking internally.

0reactions
julien-ccommented, Oct 13, 2022

Just seconding what @LysandreJik said: if we can help in any way to improve support or performance of our software on AMD chips, we’d like to help

Just ping us

Read more comments on GitHub >

github_iconTop Results From Across the Web

OpenAI GPT2 - Hugging Face
GPT-2 is a model with absolute position embeddings so it's usually advised to pad the inputs on the right rather than the left....
Read more >
How To Fit a Bigger Model and Train It Faster - Hugging Face
Load Model​​ First, we load the bert-large-uncased model. We load the model weights directly to the GPU so that we can check how...
Read more >
Fine-tuning a masked language model - Hugging Face Course
This process of fine-tuning a pretrained language model on in-domain data is usually called domain adaptation.
Read more >
gpt2 - Hugging Face
Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in this paper and first released at this...
Read more >
OpenAI GPT2 — transformers 3.5.0 documentation
Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script. The PyTorch models...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found