Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory issues when fine-tuning M2M 12B with the right GPUs

See original GitHub issue

I am trying to fine-tune the largest M2M 12B model on two A100 GPUs (2x40GB), but I am finding some memory issues. I guess I am missing some command-line argument but tried many combinations in vain.

I can run fairseq-generate with no issues, though, and then fairseq uses both GPUs during translation.

I run the training process with:

fairseq-train data_bin --finetune-from-model /models/m2m-12B-2GPU/model.pt --save-dir /checkpoint 
--task translation_multi_simple_epoch --encoder-normalize-before 
--langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' 
--lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature 
--sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy 
--label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 
--warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 
--log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big_pipeline_parallel --encoder-layers 24 
--decoder-layers 24 --encoder-attention-heads 16 --decoder-attention-heads 16 --encoder-ffn-embed-dim 16384 
--decoder-ffn-embed-dim 16384 --decoder-embed-dim 4096 --encoder-embed-dim 4096 --num-embedding-chunks 2 
--pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]' --fp16 --dataset-impl mmap --pipeline-chunks 1 
--share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d --clip-norm 1.0

And I get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB 
(GPU 0; 39.59 GiB total capacity; 15.21 GiB already allocated; 110.19 MiB free; 15.21 GiB reserved in total by PyTorch)

My environment is: fairseq 0.10.2; fairscale 0.3.1; Cuda compilation tools, release 11.1, V11.1.105; Build cuda_11.1.TC455_06.29190527_0; CUDNN, 8.0.5; GPUs 2 x A100-SXM4-40GB.

Issue Analytics

State:
Created 3 years ago
Comments:9

Top GitHub Comments

1reaction

jaspockcommented, Jul 8, 2021

No. I simply gave up. If you decide to use the models with fewer parameters, I would go for MM100 instead (search for flores101_mm100_615M here), which has been mostly trained with the same languages also by Facebook developers. I remember I read somewhere about potential issues of M2M for some language pairs; since MM100 was released later (a few weeks ago, actually), maybe these issues have been fixed. However, note that the largest MM100 available is 20 times smaller than the largest M2M (0.61B vs 12B parameters). Carbon footprint will be smaller as well 😉

Table 4 in the FLORES-101 paper describes the languages involved. The model you download probably matches the one described in section 6 of the paper. Figure 8 of the paper shows BLEU scores for all language pairs before any fine-tuning.

Sadly, I have not been able to fine-tune the 615M-parameter MM100 model in a 16GB GPU except if I use an extremely small batch size, so I guess you need a 24GB GPU minimum, or use the even smaller 0.17M-parameter MM100 model.

0reactions

stale[bot]commented, Apr 17, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!