Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

80min training time to fine-tune BERT-base on the SQuAD dataset instead of 24min?

See original GitHub issue

I just fine-tuned BERT-base on the SQuAD dataset with an AWS EC2 p3.2xlarge Deep Learning AMI with a single Tesla V100 16GB:

I used the config in your README:

export SQUAD_DIR=/path/to/SQUAD

python run_squad.py \
  --bert_model bert-base-uncased \
  --do_train \
  --do_predict \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v1.1.json \
  --predict_file $SQUAD_DIR/dev-v1.1.json \
  --train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /tmp/debug_squad/

It took 80min. According to your README:

This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.

How to explain this difference? Is there any way to accelerate the training to 24min as well? Thanks

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

thomwolfcommented, Feb 13, 2019

You should use 16bit training (--fp16 argument). You can use the dynamic loss scaling or tune the loss scale yourself if the results are not the best.

0reactions

thomwolfcommented, Jun 20, 2019

You can have a look at the readme examples but it should be a lot higher, around 88-90. Maybe your batch size is too small, look at the readme for more information.

Top Results From Across the Web

Examples — transformers 2.8.0 documentation - Hugging Face

This takes about half an hour to train on a single K80 GPU and about one minute for the evaluation to run. It...

Fine-Tune Transformer Models For Question Answering On ...

BERT was trained on unlabeled data by masking words and training the model to predict these masked words based on context. BERT was...

pytorch-pretrained-BERT: The Big-&-Extending-Repository-of ...

This repository contains op-for-op PyTorch reimplementations, pre-trained models and fine-tuning examples for: Google's BERT model,; OpenAI's GPT model,; Google ...

Parameter-Efficient Transfer Learning for NLP - arXiv

The two most common transfer learning techniques in NLP are feature-based transfer and fine-tuning. Instead, we present an alternative transfer method based ......

Fine tuning a Question Answering model using SQuAD and ...

I have been able to get >84% accuracy on the train dataset and BERT-base-uncased so it is definitely possible to get this kind...