80min training time to fine-tune BERT-base on the SQuAD dataset instead of 24min?
See original GitHub issueI just fine-tuned BERT-base on the SQuAD dataset with an AWS EC2 p3.2xlarge Deep Learning AMI with a single Tesla V100 16GB:
I used the config in your README:
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--bert_model bert-base-uncased \
--do_train \
--do_predict \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
It took 80min. According to your README:
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.
How to explain this difference? Is there any way to accelerate the training to 24min as well? Thanks
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Examples — transformers 2.8.0 documentation - Hugging Face
This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It...
Read more >Fine-Tune Transformer Models For Question Answering On ...
BERT was trained on unlabeled data by masking words and training the model to predict these masked words based on context. BERT was...
Read more >pytorch-pretrained-BERT: The Big-&-Extending-Repository-of ...
This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for: Google's BERT model,; OpenAI's GPT model,; Google ...
Read more >Parameter-Efficient Transfer Learning for NLP - arXiv
The two most common transfer learning techniques in NLP are feature-based transfer and fine-tuning. Instead, we present an alternative transfer method based ......
Read more >Fine tuning a Question Answering model using SQuAD and ...
I have been able to get >84% accuracy on the train dataset and BERT-base-uncased so it is definitely possible to get this kind...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

You should use 16bit training (
--fp16argument). You can use the dynamic loss scaling or tune the loss scale yourself if the results are not the best.You can have a look at the readme examples but it should be a lot higher, around 88-90. Maybe your batch size is too small, look at the readme for more information.