question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trainer batch size auto scaling

See original GitHub issue

🚀 Feature request

Since Trainer handles both batch_size and gradient_accumulation_steps it seems like it could detect some out-of-memory situations and handle those scenarios automatically.

Motivation

I’ve been experimenting with model search (model_type, vocab_size, num_hidden_layers, hidden_size) and it’s been somewhat difficult to manage the correct batch size for each variant. To avoid a process of trial & error and maintaining configuration tables, what I’ve been doing to overcome this is detecting memory exhaustion and adapting training arguments on the fly. It’s imperfect, but I wonder if there’s an official way to achieve this kind of behavior.

Your contribution

This is just a PoC, I’m sure there are several environments where this might be problematic. In particular CPU training on Linux is quite likely to trigger the OOM killer where the entire process is simply wiped from memory. Nevertheless, this strategy seems helpful at least some of the time.

class BatchAutoScaleTrainer(transformers.Trainer):
    ''' Try to detect application crashes due to CUDA/CPU OOMs and
        rescale batch size.  An antiprime batch_size gives best results.
        Inspired by PyTorchLightning/pytorch-lightning#1638
    '''
    def _shrink_bs(self):
        # GAS is used by both .train() and .eval() and we need to find a
        # suitable setting for both
        tbs = self.args.per_device_train_batch_size
        ebs = self.args.per_device_eval_batch_size
        gas = self.args.gradient_accumulation_steps
        for i in range(gas + 1, min(tbs, ebs) + 1):
            if tbs % i or ebs % i:
                continue
            self.args.per_device_train_batch_size = (tbs * gas) // i
            self.args.per_device_eval_batch_size = (ebs * gas) // i
            self.args.gradient_accumulation_steps = i
            return True
        return False
    def _is_oom(self, err):
        # shamelessly stolen from https://github.com/PyTorchLightning/pytorch-lightning/pull/1638/files#diff-5200c11792b86d6a07ea64820e126897aa2e3b7d3d295c92c19b141de6950afeR29-R32
        return len(err.args) == 1 and (
            "CUDA out of memory." in err.args[0]
         or "cuDNN error: CUDNN_STATUS_NOT_SUPPORTED." in err.args[0]
         or "DefaultCPUAllocator: can't allocate memory" in err.args[0]
         or "CUDA error: CUBLAS_STATUS_ALLOC_FAILED " in err.args[0]
        )
    def _auto_scale_batch_size(self, code):
        while True:
            try:
                gc.collect()
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                return code()
            except RuntimeError as err:
                if self._is_oom(err) and self._shrink_bs():
                    continue
                raise
            assert(False) # bug in _shrink_bs() most likely
    def train(self, *args, **kwds):
        train = super().train
        return self._auto_scale_batch_size(
            lambda: train(*args, **kwds))
    def evaluate(self, *args, **kwds):
        evaluate = super().evaluate
        return self._auto_scale_batch_size(
            lambda: evaluate(*args, **kwds))

Any chance something like this might be integrated with the Trainer?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Oct 28, 2021

I am very nervous about adding that kind of feature of auto scaling to the Trainer. Note that the _is_oom test for instance will catch way more CUDA errors than the OOM: haivng the wrong number of labels in your model will trigger an error with CUBLAS_STATUS_ALLOC_FAILED on most environments.

In a notebook, the kernel is in an unrecoverable state after the try/except (and torch.cuda.empty_cache() does not help), so this wouldn’t work either.

So for now, my sense is that such a feature would be more painful for the user than beneficial and I would leave the tuning of the batch size to the user.

0reactions
tlbycommented, Dec 8, 2021

@LysandreJik Indeed, thanks for the note. rentruewang/koila#12 is a hopeful sign.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Effective Training Techniques - PyTorch Lightning
Auto-scaling of batch size can be enabled to find the largest batch size that fits into memory. Large batch size often yields a...
Read more >
PyTorch Lightning - auto scale batch size - YouTube
In this video, we give a short intro to Lightning's flag auto_scale_batch_size.To learn more about Lightning, please visit the official ...
Read more >
PyTorch Lightning - Production
Auto batch scaling Automatically tries to find the largest batch size that fits into memory, before any training.
Read more >
Performance and Scalability
Training larger and larger transformer models and deploying them to production comes with a range of challenges. During training your model can require...
Read more >
PyTorch Lightning auto_scale_batch_size='power' does ...
... to try the automatic batch size finder. So I added the requested flag to the Trainer : trainer = pl.Trainer(default_root_dir=model_dir ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found