Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Is my training finetuing RoBERTa normal?

See original GitHub issue

Hi, I found it’s weird for my custom sentence-pair classification task when I try to finetune RoBERTa. I followed the official instruction finetune_custom_classification.md. The ACC of mini-batchs is only 72 after 4.5 epochs and there is not any change for training loss. Below is the part of training log.

| epoch 004:  60%|6| 11710/19494 [9:13:13<6:22:36,  2.95s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.335, bsz=63.999, num_updates=70192, lr=4.39121e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11711/19494 [9:13:16<6:07:54,  2.84s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.324, bsz=63.999, num_updates=70193, lr=4.39117e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11712/19494 [9:13:18<5:57:49,  2.76s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.301, bsz=63.999, num_updates=70194, lr=4.39113e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11713/19494 [9:13:22<6:07:35,  2.83s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.305, bsz=63.999, num_updates=70195, lr=4.3911e-06, gnorm=2.331, clip=0.000, oom=0.000
| epoch 004:  60%|6| 11714/19494 [9:13:24<6:12:05,  2.87s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70196, lr=4.39106e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11715/19494 [9:13:27<6:15:06,  2.89s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.321, bsz=63.999, num_updates=70197, lr=4.39102e-06, gnorm=2.331, clip=0.000, oom=0.00
| epoch 004:  60%|6| 11716/19494 [9:13:30<6:01:41,  2.79s/it, loss=0.777, nll_loss=0.009, ppl=1.01, wps=1987, ups=0, wpb=5633.299, bsz=63.999, num_updates=70198, lr=4.39098e-06, gnorm=2.331, clip=0.000, oom=0.000
, wall=199126, train_wall=195055, accuracy=0.727044]

And the AUC of the test data is around 56%

Model	AUC of Test Set
checkpoint1.pt	0.5563589297270759
checkpoint_1_6000.pt	0.5355381491151726
checkpoint_1_12000.pt	0.55602419048894259
checkpoint_1_18000.pt	0.5745017964339114
checkpoint2.pt	0.5630760304389548
checkpoint_2_24000.pt	0.5613800182990784
checkpoint_2_30000.pt	0.5706188212715628
checkpoint_2_36000.pt	0.5615139139943317
checkpoint3.pt	0.5755729619959384
checkpoint_3_42000.pt	0.555890294793689
checkpoint_3_48000.pt	0.5390417531409699
checkpoint_3_54000.pt	0.559014527682935

I tried the learning rate from 5e-5 to 6e-5 and above is the best result.

I found 9 types in the dictionary of label and is it expected because this is just binary classification task.

loading archive file /home/fecheng/project/fairseq/checkpoints/lr7e-6_mp150
loading archive file data/list_qp_train_en_filter.tsv/
| [input] dictionary: 50265 types
| [label] dictionary: 9 types

Below is my environment and training command

python : 3.6.7
pytorch: 1.0
GPU: P40 22G

input_data_dir=data/list_qp_train_en_filter.tsv/
TOTAL_NUM_UPDATES=187500  # after TOTAL_NUM_UPDATES, lr will be 0
WARMUP_UPDATES=500      # 6 percent of the number of updates
LR=1e-5
NUM_CLASSES=2
BATCH_SIZE=16
max_positions=150
save_dir=checkpoints/lr${LR}_mp${max_positions}
train_log=$save_dir/train.log
mkdir -p $save_dir

CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/ \
--max-positions $max_positions \
--max-sentences $BATCH_SIZE \
--max-tokens 4400 \
--task sentence_prediction \
--reset-optimizer --reset-dataloader --reset-meters \
--required-batch-size-multiple 1 \
--init-token 0 --separator-token 2 \
--arch roberta_large \
--criterion sentence_prediction \
--num-classes $NUM_CLASSES \
--dropout 0.1 --attention-dropout 0.1 \
--weight-decay 0.1 --optimizer adam --adam-betas "(0.9, 0.98)" --adam-eps 1e-06 \
--clip-norm 0.0 \
--lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \
--max-epoch 10 \
--best-checkpoint-metric accuracy --maximize-best-checkpoint-metric \
--truncate-sequence \
--update-freq 4 \
--save-dir $save_dir \
--save-interval-updates 6000 \
--keep-interval-updates -1 \
--log-format tqdm \
--find-unused-parameters

Issue Analytics

State:
Created 4 years ago
Comments:15 (7 by maintainers)

Top GitHub Comments

4reactions

myleottcommented, Aug 12, 2019

And also I find an information in my training log (…) | no existing checkpoint found checkpoints/imdb/models/pretrained/roberta.large/

That means you’re not using RoBERTa or pretraining at all – you’re just using a randomly initialized model with the BERT architecture.

So the actual dir of pretrained model is ${save_dir}/${–restore-file} not ${–restore-file}

It’s dynamic based on whether you specify an absolute path or not: https://github.com/pytorch/fairseq/blob/832491962b30fb2164bed696e1489685a885402f/fairseq/checkpoint_utils.py#L100-L103

I’ll probably modify this code to be a bit more robust to non-absolute paths.

However, there is an issue about loading checkpoint when I use this command --restore-file …/…/models/pretrained/roberta.large/model.pt

Yes, because you have --max-positions 150 in your command. The pretrained model expects --max-positions 512, so when you try to load the checkpoint it sees extra positional embeddings and can’t load them. I can try to add a fallback that trims the unused positional embeddings, but the easiest thing is to change --max-positions=512.

2reactions

myleottcommented, Aug 9, 2019

I think your command has a typo:

CUDA_VISIBLE_DEVICES=2 python -u train.py $input_data_dir \
--restore-file models/pretrained/roberta.large/
(...)

--restore-file should point to a .pt file. So it’s probably using a randomly initialized model instead of the RoBERTa model. Can you confirm whether you see the line loaded checkpoint (...)/model.pt (epoch 0 @ 0 updates) in your training log?

Top Results From Across the Web

Is my training finetuing RoBERTa normal? · Issue #999 - GitHub

Yes, because you have --max-positions 150 in your command. The pretrained model expects --max-positions 512 , so when you try to load the...

Fine-tune a pretrained model - Hugging Face

Don't worry, this is completely normal! The pretrained head of the BERT model is discarded, and replaced with a randomly initialized classification head....

News classification: fine-tuning RoBERTa on TPUs with ...

Fine -tuning large pre-trained models on downstream tasks is a common practice in Natural Language Processing. In this tutorial, we will use ...

On the Stability of Fine-tuning BERT - OpenReview

Fine -tuning pre-trained transformer-based language models such as BERT has become a common practice dominating leaderboards across various NLP benchmarks.

Finetuning RoBERTa with PyTorch - Coding Tutorial - YouTube

... Guide - In this guide I train RoBERTa using PyTorch Lightning on a Multi-label classification task. In particular the unhealthy com...