question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. Itย collects links to all the places you might be looking at while hunting down a tough bug.

And, if youโ€™re still stuck at the end, weโ€™re happy to hop on a call to see how we can help out.

Gradient accumulation doesn't work with Accelerate's `clip_grad_norm_`

See original GitHub issue

System Info

- `Accelerate` version: 0.13.0.dev0
- Platform: Linux-5.10.133+-x86_64-with-debian-bullseye-sid
- Python version: 3.7.12
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0 (True)
- `Accelerate` default config:
	Not found

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behaviour: You can directly run this colab notebook to get the error.

The main training method in the Trainer class is train_one_epoch

for step, batch in enumerate(dataloader):
    with self._accelerator.accumulate(self.model):
        self.optimizer.zero_grad()
        _, loss = self.model(**batch)
        self._accelerator.backward(loss)
        self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
        self.optimizer.step()
        self.lr_scheduler.step()
        # assuming dataset has label as key
        self._trn_loss_meter.update(
            loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
        )
        if self._accelerator.sync_gradients:
            self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
            self.global_prog_bar.update(1)

This will result in the following error:

โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ <ipython-input-21-5a5fa8902df5>:2 in <module>                                                    โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.7/dist-packages/accelerate/launchers.py:83 in notebook_launcher           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    80 โ”‚   โ”‚   โ”‚   โ”‚   print("Launching training on one GPU.")                                    โ”‚
โ”‚    81 โ”‚   โ”‚   โ”‚   else:                                                                          โ”‚
โ”‚    82 โ”‚   โ”‚   โ”‚   โ”‚   print("Launching training on one CPU.")                                    โ”‚
โ”‚ โฑ  83 โ”‚   โ”‚   โ”‚   function(*args)                                                                โ”‚
โ”‚    84 โ”‚                                                                                          โ”‚
โ”‚    85 โ”‚   else:                                                                                  โ”‚
โ”‚    86 โ”‚   โ”‚   if num_processes is None:                                                          โ”‚
โ”‚ <ipython-input-20-cd919093f91a>:16 in main                                                       โ”‚
โ”‚ <ipython-input-19-44ed46a0baca>:265 in fit                                                       โ”‚
โ”‚ <ipython-input-19-44ed46a0baca>:215 in train_one_epoch                                           โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:920 in clip_grad_norm_          โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    917 โ”‚   โ”‚   elif self.distributed_type == DistributedType.DEEPSPEED:                          โ”‚
โ”‚    918 โ”‚   โ”‚   โ”‚   # `accelerator.backward(loss)` is doing that automatically. Therefore, it's   โ”‚
โ”‚    919 โ”‚   โ”‚   โ”‚   return                                                                        โ”‚
โ”‚ โฑ  920 โ”‚   โ”‚   self.unscale_gradients()                                                          โ”‚
โ”‚    921 โ”‚   โ”‚   torch.nn.utils.clip_grad_norm_(parameters, max_norm, norm_type=norm_type)         โ”‚
โ”‚    922 โ”‚                                                                                         โ”‚
โ”‚    923 โ”‚   def clip_grad_value_(self, parameters, clip_value):                                   โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.7/dist-packages/accelerate/accelerator.py:904 in unscale_gradients        โ”‚
โ”‚                                                                                                  โ”‚
โ”‚    901 โ”‚   โ”‚   โ”‚   for opt in optimizer:                                                         โ”‚
โ”‚    902 โ”‚   โ”‚   โ”‚   โ”‚   while isinstance(opt, AcceleratedOptimizer):                              โ”‚
โ”‚    903 โ”‚   โ”‚   โ”‚   โ”‚   โ”‚   opt = opt.optimizer                                                   โ”‚
โ”‚ โฑ  904 โ”‚   โ”‚   โ”‚   โ”‚   self.scaler.unscale_(opt)                                                 โ”‚
โ”‚    905 โ”‚                                                                                         โ”‚
โ”‚    906 โ”‚   def clip_grad_norm_(self, parameters, max_norm, norm_type=2):                         โ”‚
โ”‚    907 โ”‚   โ”‚   """                                                                               โ”‚
โ”‚                                                                                                  โ”‚
โ”‚ /usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py:270 in unscale_             โ”‚
โ”‚                                                                                                  โ”‚
โ”‚   267 โ”‚   โ”‚   optimizer_state = self._per_optimizer_states[id(optimizer)]                        โ”‚
โ”‚   268 โ”‚   โ”‚                                                                                      โ”‚
โ”‚   269 โ”‚   โ”‚   if optimizer_state["stage"] is OptState.UNSCALED:                                  โ”‚
โ”‚ โฑ 270 โ”‚   โ”‚   โ”‚   raise RuntimeError("unscale_() has already been called on this optimizer sin   โ”‚
โ”‚   271 โ”‚   โ”‚   elif optimizer_state["stage"] is OptState.STEPPED:                                 โ”‚
โ”‚   272 โ”‚   โ”‚   โ”‚   raise RuntimeError("unscale_() is being called after step().")                 โ”‚
โ”‚   273                                                                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
RuntimeError: unscale_() has already been called on this optimizer since the last update().

Expected behavior

clip_grad_norm_ works fine with gradient_accumulation_steps=1, but results in error when gradient_accumulation_steps is set greater than 1.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:12 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
Gladiator07commented, Aug 18, 2022

Thanks, @muellerzr, that did work. However, unscale_gradients is not required as accelerate does it in clip_grad_norm_ (source code here)

So, the final loop looks like this

for step, batch in enumerate(dataloader):
    with self._accelerator.accumulate(self.model):
        self.optimizer.zero_grad()
        _, loss = self.model(**batch)
        self._accelerator.backward(loss)
        if self._accelerator.sync_gradients:
            self._accelerator.clip_grad_norm_(self.model.parameters(), self.args.max_grad_norm)
        self.optimizer.step()
        self.lr_scheduler.step()
        # assuming dataset has label as key
        self._trn_loss_meter.update(
            loss.item() * self.args.gradient_accumulation_steps, batch["label"].size(0)
        )
        if self._accelerator.sync_gradients:
            self.global_prog_bar.set_postfix(loss=self._trn_loss_meter.avg)
            self.global_prog_bar.update(1)

Thanks again. Closing this issue. I love this library ๐Ÿ˜ƒ

1reaction
Gladiator07commented, Aug 18, 2022

Sure, I will be happy to do it!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Potential bug with gradient clipping when using ... - GitHub
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity...
Read more >
DDP with Gradient accumulation and clip grad norm - distributed
Hello, I am trying to do gradient accumulation model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictionsย ...
Read more >
Why Gradient Clipping Accelerates Training: A Theoretical ...
This problem is circumvented by clipping because adaptivity allows the gradient descent to automatically use a small step size in steep regions with...
Read more >
[D] Does gradient accumulation achieve anything different ...
Gradient accumulation is a way to use a batch size that doesn't fit in memory, and thus is only useful in particular niche...
Read more >
How to do gradient clipping in pytorch? - Stack Overflow
I have an exploding gradients problem. python ยท machine-learning ยท deep-learning ยท pytorch ยท gradient-descent.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found