Batch size < GPU number when training with Trainer and deepspeed.
See original GitHub issue🚀 Feature request
Hi, I am fine-tuning T5-11b using Trainer with deepspeed feature. I use the deepspeed zero3 stage to split T5-11b and its gradients to different GPUs. However, when I try to use multi GPUs, I found that the argument per_device_train_batch_size
must be an integer, which means it at least is 1. So when I use more GPUs, the batch size must increase at the same time, which will cost must more GPU memory. Thus, it turns out that I can’t fine-tune T5-11b with 2, 4 or 8 A100 (40G) GPUs. So, in general, deepspeed feature doesn’t solve the memory issue if the model’s size is similar to or larger than the memory of one of the multi GPUs.
So, I request the feature that batch size is smaller than the number of GPUs like train batch size of 2 on 4 GPUs.
here is the link to the argumentper_device_train_batch_size
https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L122
Motivation
Fine-tune an extremely large model with a few small memory GPUs.
Your contribution
I think the deepspeed package supports this feature already. So, adding this feature to Trainer is not hard.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (4 by maintainers)
cc @stas00 for DeepSpeed 😉
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.