Gradient accumulation doesn't work with Accelerate's `clip_grad_norm_`
See original GitHub issueSystem Info
- `Accelerate` version: 0.13.0.dev0
- Platform: Linux-5.10.133+-x86_64-with-debian-bullseye-sid
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0 (True)
- `Accelerate` default config:
Not found
Information
- The official example scripts
- My own modified scripts
Tasks
- One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behaviour: You can directly run this colab notebook to get the error.
The main training method in the Trainer
class is train_one_epoch
for step, batch in enumerate(dataloader):
with self._accelerator.accumulate(self.model):
self.optimizer.zero_grad()
_, loss = self.model(**batch)
self._accelerator.backward(loss)
self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
self.optimizer.step()
self.lr_scheduler.step()
# assuming dataset has label as key
self._trn_loss_meter.update(
loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
)
if self._accelerator.sync_gradients:
self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
self.global_prog_bar.update(1)
This will result in the following error:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ <ipython-input-21-5a5fa8902df5>:2 in <module> โ
โ โ
โ /usr/local/lib/python3.7/dist-packages/accelerate/launchers.py:83 in notebook_launcher โ
โ โ
โ 80 โ โ โ โ print("Launching training on one GPU.") โ
โ 81 โ โ โ else: โ
โ 82 โ โ โ โ print("Launching training on one CPU.") โ
โ โฑ 83 โ โ โ function(*args) โ
โ 84 โ โ
โ 85 โ else: โ
โ 86 โ โ if num_processes is None: โ
โ <ipython-input-20-cd919093f91a>:16 in main โ
โ <ipython-input-19-44ed46a0baca>:265 in fit โ
โ <ipython-input-19-44ed46a0baca>:215 in train_one_epoch โ
โ โ
โ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:920 in clip_grad_norm_ โ
โ โ
โ 917 โ โ elif self.distributed_type == DistributedType.DEEPSPEED: โ
โ 918 โ โ โ # `accelerator.backward(loss)` is doing that automatically. Therefore, it's โ
โ 919 โ โ โ return โ
โ โฑ 920 โ โ self.unscale_gradients() โ
โ 921 โ โ torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type) โ
โ 922 โ โ
โ 923 โ def clip_grad_value_(self, parameters, clip_value): โ
โ โ
โ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:904 in unscale_gradients โ
โ โ
โ 901 โ โ โ for opt in optimizer: โ
โ 902 โ โ โ โ while isinstance(opt, AcceleratedOptimizer): โ
โ 903 โ โ โ โ โ opt = opt.optimizer โ
โ โฑ 904 โ โ โ โ self.scaler.unscale_(opt) โ
โ 905 โ โ
โ 906 โ def clip_grad_norm_(self, parameters, max_norm, norm_type=2): โ
โ 907 โ โ """ โ
โ โ
โ /usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py:270 in unscale_ โ
โ โ
โ 267 โ โ optimizer_state = self._per_optimizer_states[id(optimizer)] โ
โ 268 โ โ โ
โ 269 โ โ if optimizer_state["stage"] is OptState.UNSCALED: โ
โ โฑ 270 โ โ โ raise RuntimeError("unscale_() has already been called on this optimizer sin โ
โ 271 โ โ elif optimizer_state["stage"] is OptState.STEPPED: โ
โ 272 โ โ โ raise RuntimeError("unscale_() is being called after step().") โ
โ 273 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
RuntimeError: unscale_() has already been called on this optimizer since the last update().
Expected behavior
clip_grad_norm_
works fine with gradient_accumulation_steps=1
, but results in error when gradient_accumulation_steps
is set greater than 1.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (9 by maintainers)
Top Results From Across the Web
Potential bug with gradient clipping when using ... - GitHub
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity...
Read more >DDP with Gradient accumulation and clip grad norm - distributed
Hello, I am trying to do gradient accumulation model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictionsย ...
Read more >Why Gradient Clipping Accelerates Training: A Theoretical ...
This problem is circumvented by clipping because adaptivity allows the gradient descent to automatically use a small step size in steep regions with...
Read more >[D] Does gradient accumulation achieve anything different ...
Gradient accumulation is a way to use a batch size that doesn't fit in memory, and thus is only useful in particular niche...
Read more >How to do gradient clipping in pytorch? - Stack Overflow
I have an exploding gradients problem. python ยท machine-learning ยท deep-learning ยท pytorch ยท gradient-descent.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks, @muellerzr, that did work. However,
unscale_gradients
is not required as accelerate does it inclip_grad_norm_
(source code here)So, the final loop looks like this
Thanks again. Closing this issue. I love this library ๐
Sure, I will be happy to do it!