question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BlenderBot 3 30B. CUDA out of memory.

See original GitHub issue

Bug description Hello, I have been trying to run BlenderBot 3 30B and I keep getting the same problem. When I run metaseq-api-local, CUDA runs out of memory but when I try to reduce the number of batches, the same error keeps popping up regardless of the batch size. These are my parameters on the constants.py file:

MAX_SEQ_LEN = 1024 BATCH_SIZE = 64 # silly high bc we dynamically batch by MAX_BATCH_TOKENS MAX_BATCH_TOKENS = 1024 DEFAULT_PORT = 6010 MODEL_PARALLEL = 4 TOTAL_WORLD_SIZE = 4 MAX_BEAM = 8

I am currently using 4 T4 GPUs on GCP.

Reproduction steps

  1. Modify the constants.py file as shown above
  2. Run metaseq-api-local

Expected behavior Give a clear and concise description of what you expected to happen.

Logs Please paste the command line output:


RuntimeError: CUDA out of memory. Tried to allocate 296.00 MiB (GPU 0; 14.62 GiB total capacity; 13.69 GiB already allocated; 237.00 MiB free; 13.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Additional context I would really appreciate it if someone can indicate to me how to change such parameters so that I can run the model with the resources I have. Thank you

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
klshustercommented, Sep 1, 2022

With the way that metaseq is set up, each sharded checkpoint file will be loaded onto a single GPU. So, I’m not totally sure the situations you listed above are possible in the current setup. model parallel does not necessarily need to be same as total world size - there is a way to run the model with e.g. the model parallel + fully-sharded data parallel shards, but that is not recommended (FSDP during inference is quite slow due to node communication latency)

1reaction
klshustercommented, Aug 26, 2022

assuming you used the reshard_model_parallel script, pull in these changes to your metaseq checkout: https://github.com/facebookresearch/metaseq/pull/170. that should fix the problem

Read more comments on GitHub >

github_iconTop Results From Across the Web

Out of Memory (OOM) when repeatedly running large models
Any advice for freeing up GPU memory after training a large model (e.g., ... 129 130 def gelu_new(x): RuntimeError: CUDA out of memory....
Read more >
Cycles / CUDA Error: Out of Memory - Blender Stack Exchange
The short answer is that SSS on the GPU eats up a lot of memory, so much so that it is recommended to...
Read more >
Bb3 - ParlAI
BlenderBot 3 (BB3) is a 175B-parameter, publicly available chatbot released with model weights, code, datasets, and model cards. We've deployed it in a...
Read more >
BlenderBot 3: An AI Chatbot That Improves Through ... - Meta
Our new AI research chatbot is designed to improve its conversational skills and safety through feedback from people who use it.
Read more >
Cuda Error: Out of Memory : r/blenderhelp - Reddit
Open blender and scene. See how much ram the scene is using. Start a render and compare the difference. If the difference is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found