Can't run 11 billion model on A100 with 80GB
See original GitHub issueHi @craffel @muqeeth @HaokunLiu,
We’re trying to reproduce T-Few
results for a paper, but we’re getting ‘CUDA out of memory’ using an A100 with 80GB (your recommended setup).
This is what we’re running:
python -m src.pl_train -c t011b.json+ia3.json+rte.json -k load_weight="pretrained_checkpoints/t011b_ia3_finish.pt" exp_name=t011b_rte_seed42_ia3_pretrained few_shot_random_seed=42 seed=42
We installed according to the README instructions and are using the default settings in the config files. We are able to run the 3 billion model using the command above, just not the 11 billion. Is there anything we are doing wrong?
This is the exception:
Thank you
Issue Analytics
- State:
- Created a year ago
- Comments:5
Top Results From Across the Web
NVIDIA A100 Tensor Core GPU
The A100 80GB debuts the world's fastest memory bandwidth at over 2 terabytes per second (TB/s) to run the largest models and datasets....
Read more >[DeepSpeed] [success] trained t5-11b on 1x 40GB gpu #9996
I cannot train a 13B multilingual mT5-xxl model on the 8x40GB A100 on aws p4d24xlarge . I am using This config with "fp16":...
Read more >Nvidia launches A100 80GB GPU for supercomputers
Nvidia launched its 80GB version of the A100 graphics processing unit (GPU), targeting the graphics and AI chip at supercomputers.
Read more >Understand BLOOM, the Largest Open-Access AI, and Run It ...
BLOOM is an open-access multilingual language model that contains 176 billion parameters and was trained for 3.5 months on 384 A100–80GB GPUs.
Read more >NVIDIA's A100 GPU - YouTube
Your browser can't play this video. ... 11K views 2 years ago ... It's packing 54 billion transistors, which is the most ever...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Sorry I think the config might be slightly off as it was meant for the 3B and not 11B versions. For the 11B variants, to fit into memory, we used a smaller batch size but still had an effect batch size of 8. Our hyperparameters were
batch_size=1 grad_accum_factor=8 eval_batch_size=2
. Let us know if it still runs out of memory.Thanks for your interest in our work!
It’s hard to tell from the surface. Could you share with me the full log? And if you are familiar with pytorch lightning, mind if add something like
print("Memory usage at line [add something here]", torch.cuda.memory_allocated(device=None))
in the start and end of training_step of EncoderDecoder.py?