lr decrease to 0 when fine-tune CNNDM for summarization
See original GitHub issueWhat is your question?
Hi, I was using fairseq to fine-tune CNNDM for summarization based on BART-large. Here is my situation: I have 3 1080Ti GPU with 12GB memory each to train this model. However it doesn’t support fp16. My script is below:
TOTAL_NUM_UPDATES=20000 WARMUP_UPDATES=500 LR=3e-05 MAX_TOKENS=1024 UPDATE_FREQ=1 BART_PATH=./bart.large/model.pt CUDA_VISIBLE_DEVICES=3,6,7 fairseq-train cnn_dm-bin \ --restore-file $BART_PATH \ --max-tokens $MAX_TOKENS \ --task translation \ --source-lang source --target-lang target \ --truncate-source \ --layernorm-embedding \ --share-all-embeddings \ --share-decoder-input-output-embed \ --reset-optimizer --reset-dataloader --reset-meters \ --required-batch-size-multiple 1 \ --arch bart_large \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --dropout 0.1 --attention-dropout 0.1 \ --weight-decay 0.01 --optimizer adam --adam-betas "(0.9, 0.999)" --adam-eps 1e-08 \ --clip-norm 0.1 \ --lr-scheduler polynomial_decay --lr $LR --total-num-update $TOTAL_NUM_UPDATES --warmup-updates $WARMUP_UPDATES \ --update-freq $UPDATE_FREQ \ --skip-invalid-size-inputs-valid-test \ --memory-efficient-fp16 \ --find-unused-parameters
However, when I trained the model, learning rate decreases constantly and finally becomes 0 in epoch 1(20008 / 84773) Meanwhile, it cause loss or nll_loss stop decrease. There’s totally somewhere wrong. Then what can I do for this situation? I’m looking forward for your kind reply.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (1 by maintainers)
Top GitHub Comments
I followed the instructions here exactly to train the model: https://github.com/pytorch/fairseq/blob/main/examples/bart/README.summarization.md.
When the num_updates meets 20000, the lr becomes 0. But, the training process won’t stop, and keep training. Is this normal, should I just exit the train. Does it means that the training is finished?
Thanks a lot! @monologue1107 @myleott
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!