Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch size < GPU number when training with Trainer and deepspeed.

See original GitHub issue

🚀 Feature request

Hi, I am fine-tuning T5-11b using Trainer with deepspeed feature. I use the deepspeed zero3 stage to split T5-11b and its gradients to different GPUs. However, when I try to use multi GPUs, I found that the argument per_device_train_batch_size must be an integer, which means it at least is 1. So when I use more GPUs, the batch size must increase at the same time, which will cost must more GPU memory. Thus, it turns out that I can’t fine-tune T5-11b with 2, 4 or 8 A100 (40G) GPUs. So, in general, deepspeed feature doesn’t solve the memory issue if the model’s size is similar to or larger than the memory of one of the multi GPUs. So, I request the feature that batch size is smaller than the number of GPUs like train batch size of 2 on 4 GPUs. here is the link to the argumentper_device_train_batch_size https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L122

Motivation

Fine-tune an extremely large model with a few small memory GPUs.

Your contribution

I think the deepspeed package supports this feature already. So, adding this feature to Trainer is not hard.

Issue Analytics

State:
Created a year ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Apr 20, 2022

cc @stas00 for DeepSpeed 😉

0reactions

github-actions[bot]commented, May 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.