BlenderBot 3 30B. CUDA out of memory.
See original GitHub issueBug description
Hello, I have been trying to run BlenderBot 3 30B and I keep getting the same problem. When I run metaseq-api-local, CUDA runs out of memory but when I try to reduce the number of batches, the same error keeps popping up regardless of the batch size.
These are my parameters on the constants.py file:
MAX_SEQ_LEN = 1024 BATCH_SIZE = 64 # silly high bc we dynamically batch by MAX_BATCH_TOKENS MAX_BATCH_TOKENS = 1024 DEFAULT_PORT = 6010 MODEL_PARALLEL = 4 TOTAL_WORLD_SIZE = 4 MAX_BEAM = 8
I am currently using 4 T4 GPUs on GCP.
Reproduction steps
- Modify the
constants.pyfile as shown above - Run
metaseq-api-local
Expected behavior Give a clear and concise description of what you expected to happen.
Logs Please paste the command line output:
RuntimeError: CUDA out of memory. Tried to allocate 296.00 MiB (GPU 0; 14.62 GiB total capacity; 13.69 GiB already allocated; 237.00 MiB free; 13.82 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Additional context I would really appreciate it if someone can indicate to me how to change such parameters so that I can run the model with the resources I have. Thank you
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)

Top Related StackOverflow Question
With the way that metaseq is set up, each sharded checkpoint file will be loaded onto a single GPU. So, I’m not totally sure the situations you listed above are possible in the current setup. model parallel does not necessarily need to be same as total world size - there is a way to run the model with e.g. the model parallel + fully-sharded data parallel shards, but that is not recommended (FSDP during inference is quite slow due to node communication latency)
assuming you used the
reshard_model_parallelscript, pull in these changes to your metaseq checkout: https://github.com/facebookresearch/metaseq/pull/170. that should fix the problem