Trainer batch size auto scaling
See original GitHub issue🚀 Feature request
Since Trainer
handles both batch_size and gradient_accumulation_steps it seems like it could detect some out-of-memory situations and handle those scenarios automatically.
Motivation
I’ve been experimenting with model search (model_type, vocab_size, num_hidden_layers, hidden_size) and it’s been somewhat difficult to manage the correct batch size for each variant. To avoid a process of trial & error and maintaining configuration tables, what I’ve been doing to overcome this is detecting memory exhaustion and adapting training arguments on the fly. It’s imperfect, but I wonder if there’s an official way to achieve this kind of behavior.
Your contribution
This is just a PoC, I’m sure there are several environments where this might be problematic. In particular CPU training on Linux is quite likely to trigger the OOM killer where the entire process is simply wiped from memory. Nevertheless, this strategy seems helpful at least some of the time.
class BatchAutoScaleTrainer(transformers.Trainer):
''' Try to detect application crashes due to CUDA/CPU OOMs and
rescale batch size. An antiprime batch_size gives best results.
Inspired by PyTorchLightning/pytorch-lightning#1638
'''
def _shrink_bs(self):
# GAS is used by both .train() and .eval() and we need to find a
# suitable setting for both
tbs = self.args.per_device_train_batch_size
ebs = self.args.per_device_eval_batch_size
gas = self.args.gradient_accumulation_steps
for i in range(gas + 1, min(tbs, ebs) + 1):
if tbs % i or ebs % i:
continue
self.args.per_device_train_batch_size = (tbs * gas) // i
self.args.per_device_eval_batch_size = (ebs * gas) // i
self.args.gradient_accumulation_steps = i
return True
return False
def _is_oom(self, err):
# shamelessly stolen from https://github.com/PyTorchLightning/pytorch-lightning/pull/1638/files#diff-5200c11792b86d6a07ea64820e126897aa2e3b7d3d295c92c19b141de6950afeR29-R32
return len(err.args) == 1 and (
"CUDA out of memory." in err.args[0]
or "cuDNN error: CUDNN_STATUS_NOT_SUPPORTED." in err.args[0]
or "DefaultCPUAllocator: can't allocate memory" in err.args[0]
or "CUDA error: CUBLAS_STATUS_ALLOC_FAILED " in err.args[0]
)
def _auto_scale_batch_size(self, code):
while True:
try:
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
return code()
except RuntimeError as err:
if self._is_oom(err) and self._shrink_bs():
continue
raise
assert(False) # bug in _shrink_bs() most likely
def train(self, *args, **kwds):
train = super().train
return self._auto_scale_batch_size(
lambda: train(*args, **kwds))
def evaluate(self, *args, **kwds):
evaluate = super().evaluate
return self._auto_scale_batch_size(
lambda: evaluate(*args, **kwds))
Any chance something like this might be integrated with the Trainer?
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
I am very nervous about adding that kind of feature of auto scaling to the Trainer. Note that the
_is_oom
test for instance will catch way more CUDA errors than the OOM: haivng the wrong number of labels in your model will trigger an error withCUBLAS_STATUS_ALLOC_FAILED
on most environments.In a notebook, the kernel is in an unrecoverable state after the
try
/except
(andtorch.cuda.empty_cache()
does not help), so this wouldn’t work either.So for now, my sense is that such a feature would be more painful for the user than beneficial and I would leave the tuning of the batch size to the user.
@LysandreJik Indeed, thanks for the note. rentruewang/koila#12 is a hopeful sign.