question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GPT2 large trains on 1 GPU but does not fit in two.

See original GitHub issue

Hi all,

I am training GPT2 from scratch with the following command:

torchrun --nproc_per_node=2 --nnodes=1 ./5.run_clm-post.py --model_name_or_path gpt2-large --train_file datasets/sample.txt --tokenizer_name myembeddings --do_train --do_eval --output_dir ./sample --evaluation_strategy epoch --num_train_epochs 100 --per_device_train_batch_size 24 --cache_dir .cache/

When I train on a single A100, the model trains perfectly. When running on 2 GPUs (both A100s) I get the CUDA out of memory error. I tried to decrease to batch size 16 but still happens. Does this it mean that I have to go to batch size 8? Why does batch size 24 fit on a single GPU but not in two?

Below are the errors:

With batch size 16:

File "/path/to/miniconda3/lib/python3.6/site-packages/transformers/activations.py", line 42, in gelu_new
    return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
RuntimeError: CUDA out of memory. Tried to allocate 320.00 MiB (GPU 0; 39.59 GiB total capacity; 36.81 GiB already allocated; 205.69 MiB free; 37.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

With batch size 24:

File "/path/to/miniconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1169, in dropout
    return _VF.dropout_(input, p, training) if inplace else _VF.dropout(input, p, training)
RuntimeError: CUDA out of memory. Tried to allocate 1.88 GiB (GPU 1; 39.59 GiB total capacity; 36.11 GiB already allocated; 909.69 MiB free; 36.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Any help would be appreciated. Also, any advice to make the model train faster would be great to follow. Thanks for this great repository.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Dec 3, 2021

This really great document, written by @stas00 , may be of help 😃

0reactions
github-actions[bot]commented, Dec 31, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Efficient Training on Multiple GPUs
When training on a single GPU is too slow or the model weights don't fit in a single GPUs memory we use a...
Read more >
How to fine tune VERY large model if it doesn't fit on your ...
Let's discuss some of the approaches and see how to use them to fine-tune 1.5 billion parameters GPT-2-XL model in the end of...
Read more >
[P] Guide: Finetune GPT2-XL (1.5 Billion Parameters, the ...
So i figured out how to run it with deepspeed and gradient checkpointing, which reduces the required GPU memory. Now it can fit...
Read more >
Parameter sharing, revisited (again) – Weights & Biases
When applied to language models, one could train a "shared GPT-3" that would only have ~2.5B parameters and fit into most GPU setups....
Read more >
Mosaic LLMs (Part 1): Billion-Parameter GPT Training ...
We also discover that larger models can train more efficiently than smaller ... GPT-2 model and edited the model settings to match GPT-3 ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found