Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Can't run 11 billion model on A100 with 80GB

See original GitHub issue

Hi @craffel @muqeeth @HaokunLiu,

We’re trying to reproduce T-Few results for a paper, but we’re getting ‘CUDA out of memory’ using an A100 with 80GB (your recommended setup).

This is what we’re running:

python -m src.pl_train -c t011b.json+ia3.json+rte.json -k load_weight="pretrained_checkpoints/t011b_ia3_finish.pt" exp_name=t011b_rte_seed42_ia3_pretrained few_shot_random_seed=42 seed=42

We installed according to the README instructions and are using the default settings in the config files. We are able to run the 3 billion model using the command above, just not the 11 billion. Is there anything we are doing wrong?

This is the exception:

Thank you

Issue Analytics

State:
Created a year ago
Comments:5

Top GitHub Comments

1reaction

dptamcommented, Jul 15, 2022

Sorry I think the config might be slightly off as it was meant for the 3B and not 11B versions. For the 11B variants, to fit into memory, we used a smaller batch size but still had an effect batch size of 8. Our hyperparameters werebatch_size=1 grad_accum_factor=8 eval_batch_size=2. Let us know if it still runs out of memory.

1reaction

HaokunLiucommented, Jul 12, 2022

Thanks for your interest in our work!

It’s hard to tell from the surface. Could you share with me the full log? And if you are familiar with pytorch lightning, mind if add something like print("Memory usage at line [add something here]", torch.cuda.memory_allocated(device=None)) in the start and end of training_step of EncoderDecoder.py?