Is my training finetuing RoBERTa normal?
See original GitHub issueHi, I found it’s weird for my custom sentence-pair classification task when I try to finetune RoBERTa. I followed the official instruction finetune_custom_classification.md. The ACC of mini-batchs is only 72 after 4.5 epochs and there is not any change for training loss. Below is the part of training log.
| epoch 004: 60%|6| 11710/19494 [9:13:13<6:22:36, 2.95s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.335, bsz=63.999, num_updates=70192, lr=4.39121e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004: 60%|6| 11711/19494 [9:13:16<6:07:54, 2.84s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.324, bsz=63.999, num_updates=70193, lr=4.39117e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004: 60%|6| 11712/19494 [9:13:18<5:57:49, 2.76s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.301, bsz=63.999, num_updates=70194, lr=4.39113e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004: 60%|6| 11713/19494 [9:13:22<6:07:35, 2.83s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.305, bsz=63.999, num_updates=70195, lr=4.3911e-06, gnorm=2.331, clip=0.000, oom=0.000
| epoch 004: 60%|6| 11714/19494 [9:13:24<6:12:05, 2.87s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70196, lr=4.39106e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004: 60%|6| 11715/19494 [9:13:27<6:15:06, 2.89s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70197, lr=4.39102e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004: 60%|6| 11716/19494 [9:13:30<6:01:41, 2.79s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.299, bsz=63.999, num_updates=70198, lr=4.39098e-06, gnorm=2.331, clip=0.000, oom=0.000
, wall=199126, train_wall=195055, accuracy=0.727044]
And the AUC of the test data is around 56%
Model | AUC of Test Set |
---|---|
checkpoint1.pt | 0.5563589297270759 |
checkpoint_1_6000.pt | 0.5355381491151726 |
checkpoint_1_12000.pt | 0.55602419048894259 |
checkpoint_1_18000.pt | 0.5745017964339114 |
checkpoint2.pt | 0.5630760304389548 |
checkpoint_2_24000.pt | 0.5613800182990784 |
checkpoint_2_30000.pt | 0.5706188212715628 |
checkpoint_2_36000.pt | 0.5615139139943317 |
checkpoint3.pt | 0.5755729619959384 |
checkpoint_3_42000.pt | 0.555890294793689 |
checkpoint_3_48000.pt | 0.5390417531409699 |
checkpoint_3_54000.pt | 0.559014527682935 |
I tried the learning rate from 5e-5 to 6e-5 and above is the best result.
I found 9 types in the dictionary of label and is it expected because this is just binary classification task.
loading archive file /home/fecheng/project/fairseq/checkpoints/lr7e-6_mp150
loading archive file data/list_qp_train_en_filter.tsv/
| [input] dictionary: 50265 types
| [label] dictionary: 9 types
Below is my environment and training command
python : 3.6.7
pytorch: 1.0
GPU: P40 22G
input_data_dir=data/list_qp_train_en_filter.tsv/
TOTAL_NUM_UPDATES=187500 # after TOTAL_NUM_UPDATES, lr will be 0
WARMUP_UPDATES=500 # 6 percent of the number of updates
LR=1e-5
NUM_CLASSES=2
BATCH_SIZE=16
max_positions=150
save_dir=checkpoints/lr${LR}_mp${max_positions}
train_log=$save_dir/train.log
mkdir -p $save_dir
CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/ \
--max-positions $max_positions \
--max-sentences $BATCH_SIZE \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--truncate-sequence \
--update-freq 4 \
--save-dir $save_dir \
--save-interval-updates 6000 \
--keep-interval-updates -1 \
--log-format tqdm \
--find-unused-parameters
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (7 by maintainers)
Top Results From Across the Web
Is my training finetuing RoBERTa normal? · Issue #999 - GitHub
Yes, because you have --max-positions 150 in your command. The pretrained model expects --max-positions 512 , so when you try to load the...
Read more >Fine-tune a pretrained model - Hugging Face
Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head....
Read more >News classification: fine-tuning RoBERTa on TPUs with ...
Fine -tuning large pre-trained models on downstream tasks is a common practice in Natural Language Processing. In this tutorial, we will use ...
Read more >On the Stability of Fine-tuning BERT - OpenReview
Fine -tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.
Read more >Finetuning RoBERTa with PyTorch - Coding Tutorial - YouTube
... Guide - In this guide I train RoBERTa using PyTorch Lightning on a Multi-label classification task. In particular the unhealthy com...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That means you’re not using RoBERTa or pretraining at all – you’re just using a randomly initialized model with the BERT architecture.
It’s dynamic based on whether you specify an absolute path or not: https://github.com/pytorch/fairseq/blob/832491962b30fb2164bed696e1489685a885402f/fairseq/checkpoint_utils.py#L100-L103
I’ll probably modify this code to be a bit more robust to non-absolute paths.
Yes, because you have
--max-positions 150
in your command. The pretrained model expects--max-positions 512
, so when you try to load the checkpoint it sees extra positional embeddings and can’t load them. I can try to add a fallback that trims the unused positional embeddings, but the easiest thing is to change--max-positions=512
.I think your command has a typo:
--restore-file
should point to a.pt
file. So it’s probably using a randomly initialized model instead of the RoBERTa model. Can you confirm whether you see the lineloaded checkpoint (...)/model.pt (epoch 0 @ 0 updates)
in your training log?