question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Memory issues when fine-tuning M2M 12B with the right GPUs

See original GitHub issue

I am trying to fine-tune the largest M2M 12B model on two A100 GPUs (2x40GB), but I am finding some memory issues. I guess I am missing some command-line argument but tried many combinations in vain.

I can run fairseq-generate with no issues, though, and then fairseq uses both GPUs during translation.

I run the training process with:

fairseq-train data_bin --finetune-from-model /models/m2m-12B-2GPU/model.pt --save-dir /checkpoint 
--task translation_multi_simple_epoch --encoder-normalize-before 
--langs 'af,am,ar,ast,az,ba,be,bg,bn,br,bs,ca,ceb,cs,cy,da,de,el,en,es,et,fa,ff,fi,fr,fy,ga,gd,gl,gu,ha,he,hi,hr,ht,hu,hy,id,ig,ilo,is,it,ja,jv,ka,kk,km,kn,ko,lb,lg,ln,lo,lt,lv,mg,mk,ml,mn,mr,ms,my,ne,nl,no,ns,oc,or,pa,pl,ps,pt,ro,ru,sd,si,sk,sl,so,sq,sr,ss,su,sv,sw,ta,th,tl,tn,tr,uk,ur,uz,vi,wo,xh,yi,yo,zh,zu' 
--lang-pairs 'en-es,es-en' --max-tokens 1200 --decoder-normalize-before --sampling-method temperature 
--sampling-temperature 1.5 --encoder-langtok src --decoder-langtok --criterion label_smoothed_cross_entropy 
--label-smoothing 0.2 --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --lr 3e-05 
--warmup-updates 2500 --max-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --update-freq 2 
--save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 
--log-format simple --log-interval 2 --patience 10 --arch transformer_wmt_en_de_big_pipeline_parallel --encoder-layers 24 
--decoder-layers 24 --encoder-attention-heads 16 --decoder-attention-heads 16 --encoder-ffn-embed-dim 16384 
--decoder-ffn-embed-dim 16384 --decoder-embed-dim 4096 --encoder-embed-dim 4096 --num-embedding-chunks 2 
--pipeline-balance '[29,22,1]' --pipeline-devices '[0,1,0]' --fp16 --dataset-impl mmap --pipeline-chunks 1 
--share-decoder-input-output-embed --share-all-embeddings --ddp-backend no_c10d --clip-norm 1.0

And I get the following error:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB 
(GPU 0; 39.59 GiB total capacity; 15.21 GiB already allocated; 110.19 MiB free; 15.21 GiB reserved in total by PyTorch)

My environment is: fairseq 0.10.2; fairscale 0.3.1; Cuda compilation tools, release 11.1, V11.1.105; Build cuda_11.1.TC455_06.29190527_0; CUDNN, 8.0.5; GPUs 2 x A100-SXM4-40GB.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
jaspockcommented, Jul 8, 2021

No. I simply gave up. If you decide to use the models with fewer parameters, I would go for MM100 instead (search for flores101_mm100_615M here), which has been mostly trained with the same languages also by Facebook developers. I remember I read somewhere about potential issues of M2M for some language pairs; since MM100 was released later (a few weeks ago, actually), maybe these issues have been fixed. However, note that the largest MM100 available is 20 times smaller than the largest M2M (0.61B vs 12B parameters). Carbon footprint will be smaller as well 😉

Table 4 in the FLORES-101 paper describes the languages involved. The model you download probably matches the one described in section 6 of the paper. Figure 8 of the paper shows BLEU scores for all language pairs before any fine-tuning.

Sadly, I have not been able to fine-tune the 615M-parameter MM100 model in a 16GB GPU except if I use an extremely small batch size, so I guess you need a 24GB GPU minimum, or use the even smaller 0.17M-parameter MM100 model.

0reactions
stale[bot]commented, Apr 17, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory issues when fine-tuning M2M 12B with the right GPUs
I am trying to fine-tune the largest M2M 12B model on two A100 GPUs (2x40GB), but I am finding some memory issues.
Read more >
CUDA out of memory when using Trainer with compute_metrics
Recently, I want to fine-tuning Bart-base with Transformers (version 4.1.1). ... the CUDA out of memory error during the evaluation stage.
Read more >
Harmony: Overcoming the Hurdles of GPU Memory Capacity ...
(c) shows that PP's swap volume is unbalanced across GPUs, resulting in pipeline bottleneck. and swaps across GPUs; however, we find this ...
Read more >
Colab GPU Benchmarks for Fine-Tuning BERT - YouTube
Overview and comparison of the Tesla GPUs available in Google Colab. We'll look at their memory capacity and compare their training speeds.
Read more >
How to Finetune fairser M2M 100 Model for a Language
Although there are no errors, sadly, however, I haven't managed to fine-tune the largest 12B model in spite of having 2x40GB GPUs. I...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found