Memory issues when fine-tuning M2M 12B with the right GPUs
See original GitHub issueI am trying to fine-tune the largest M2M 12B model on two A100 GPUs (2x40GB), but I am finding some memory issues. I guess I am missing some command-line argument but tried many combinations in vain.
I can run fairseq-generate
with no issues, though, and then fairseq uses both GPUs during translation.
I run the training process with:
fairseq-train data_bin --finetune-from-model /models/m2m-12B-2GPU/model.pt --save-dir /checkpoint
--task translation_multi_simple_epoch --encoder-normalize-before
--langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu'
--lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature
--sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy
--label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05
--warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222
--log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big_pipeline_parallel --encoder-layers 24
--decoder-layers 24 --encoder-attention-heads 16 --decoder-attention-heads 16 --encoder-ffn-embed-dim 16384
--decoder-ffn-embed-dim 16384 --decoder-embed-dim 4096 --encoder-embed-dim 4096 --num-embedding-chunks 2
--pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]' --fp16 --dataset-impl mmap --pipeline-chunks 1
--share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d --clip-norm 1.0
And I get the following error:
RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB
(GPU 0; 39.59 GiB total capacity; 15.21 GiB already allocated; 110.19 MiB free; 15.21 GiB reserved in total by PyTorch)
My environment is: fairseq 0.10.2; fairscale 0.3.1; Cuda compilation tools, release 11.1, V11.1.105; Build cuda_11.1.TC455_06.29190527_0; CUDNN, 8.0.5; GPUs 2 x A100-SXM4-40GB.
Issue Analytics
- State:
- Created 3 years ago
- Comments:9
Top Results From Across the Web
Memory issues when fine-tuning M2M 12B with the right GPUs
I am trying to fine-tune the largest M2M 12B model on two A100 GPUs (2x40GB), but I am finding some memory issues.
Read more >CUDA out of memory when using Trainer with compute_metrics
Recently, I want to fine-tuning Bart-base with Transformers (version 4.1.1). ... the CUDA out of memory error during the evaluation stage.
Read more >Harmony: Overcoming the Hurdles of GPU Memory Capacity ...
(c) shows that PP's swap volume is unbalanced across GPUs, resulting in pipeline bottleneck. and swaps across GPUs; however, we find this ...
Read more >Colab GPU Benchmarks for Fine-Tuning BERT - YouTube
Overview and comparison of the Tesla GPUs available in Google Colab. We'll look at their memory capacity and compare their training speeds.
Read more >How to Finetune fairser M2M 100 Model for a Language
Although there are no errors, sadly, however, I haven't managed to fine-tune the largest 12B model in spite of having 2x40GB GPUs. I...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
No. I simply gave up. If you decide to use the models with fewer parameters, I would go for MM100 instead (search for
flores101_mm100_615M
here), which has been mostly trained with the same languages also by Facebook developers. I remember I read somewhere about potential issues of M2M for some language pairs; since MM100 was released later (a few weeks ago, actually), maybe these issues have been fixed. However, note that the largest MM100 available is 20 times smaller than the largest M2M (0.61B vs 12B parameters). Carbon footprint will be smaller as well 😉Table 4 in the FLORES-101 paper describes the languages involved. The model you download probably matches the one described in section 6 of the paper. Figure 8 of the paper shows BLEU scores for all language pairs before any fine-tuning.
Sadly, I have not been able to fine-tune the 615M-parameter MM100 model in a 16GB GPU except if I use an extremely small batch size, so I guess you need a 24GB GPU minimum, or use the even smaller 0.17M-parameter MM100 model.
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!