question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Batch size < GPU number when training with Trainer and deepspeed.

See original GitHub issue

🚀 Feature request

Hi, I am fine-tuning T5-11b using Trainer with deepspeed feature. I use the deepspeed zero3 stage to split T5-11b and its gradients to different GPUs. However, when I try to use multi GPUs, I found that the argument per_device_train_batch_size must be an integer, which means it at least is 1. So when I use more GPUs, the batch size must increase at the same time, which will cost must more GPU memory. Thus, it turns out that I can’t fine-tune T5-11b with 2, 4 or 8 A100 (40G) GPUs. So, in general, deepspeed feature doesn’t solve the memory issue if the model’s size is similar to or larger than the memory of one of the multi GPUs. So, I request the feature that batch size is smaller than the number of GPUs like train batch size of 2 on 4 GPUs. here is the link to the argumentper_device_train_batch_size https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L122

Motivation

Fine-tune an extremely large model with a few small memory GPUs.

Your contribution

I think the deepspeed package supports this feature already. So, adding this feature to Trainer is not hard.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Apr 20, 2022

cc @stas00 for DeepSpeed 😉

0reactions
github-actions[bot]commented, May 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Batch size in trainer eval loop - Hugging Face Forums
The evaluation will use all GPUs like the training, so the effective batch size will be the per_device_batch_size multiplied by the number of ......
Read more >
Training your large model with DeepSpeed
For example, training a GPT-3 model on 4K GPUs, and with a batch size limit of 2K will result in a batch on...
Read more >
GPU training (FAQ) - PyTorch Lightning - Read the Docs
In DDP, DDP_SPAWN, Deepspeed, DDP_SHARDED, or Horovod your effective batch size will be 7 * devices * num_nodes. # effective batch size =...
Read more >
Efficient Large-Scale Language Model Training on GPU ...
We in- tegrated ZeRO into our codebase using the DeepSpeed Python library [3]. We keep the global batch size the same as we...
Read more >
LongFormer Training with DeepSpeed and HF-Trainer - Kaggle
DeepSpeed Library allows one to Train Large models with bigger batch sizes on smaller GPUs , This notebook integrates DeepSpeed with HF ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found