question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GradsScalarHandler logs 0 gradients if default update function is used

See original GitHub issue

🐛 Bug description

Logging the gradients per epoch / iteration is a useful way to debug an under-performing model. Ignite provides an easy-to-use tensorboard_logger handler, an example accessible from ignite.contrib.handlers.tensorboard_logger.GradsScalarHandler. However, the default update function used by Engines generated by create_supervised_trainer zero the gradients before terminating, causing the handler to log zeroed out gradients all the time.

Steps to reproduce: My code is too complicated at the momemt to provide a clear insight, and I am limited by time to provide a minimal (not-)working example, so I will provide abstracted steps.

  1. Generate an Engine / DeterministicEngine on an arbitrary problem by the create_supervised_trainer method.
  2. Establish a TensorboardLogger and a .
  3. Attach a GradsScalarHandler / your choice of a gradient logger. Also log the training loss or some other metric.
  4. Start the training, check tensorboard and see the constant-0 gradient norms / gradients, despite the losses/metrics implying some sort of improvement/learning takes place.

Solution proposal

Preserving the gradients until epoch end is tricky, but not required for my purposes. If we are OK with using Events.ITERATION_COMPLETED as a cue to log gradients, then we can simply modify the default update functions as follows:

(assuming engine.state.iteration counts from 1).

    def update(engine: Engine, batch: Sequence[torch.Tensor]) -> Union[Any, Tuple[torch.Tensor]]:
        if (engine.state.iteration + 1) % gradient_accumulation_steps == 0:
            optimizer.zero_grad()
        model.train()
        x, y = prepare_batch(batch, device=device, non_blocking=non_blocking)
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        if gradient_accumulation_steps > 1:
            loss = loss / gradient_accumulation_steps
        loss.backward()
        if engine.state.iteration % gradient_accumulation_steps == 0:
            optimizer.step()
        return output_transform(x, y, y_pred, loss)

This way, upon completion of update and at the moment of Events.ITERATION_COMPLETED firing, there will be some non-zero gradients available to be logged.

Environment (latest version of Ignite still has the same bug)

  • PyTorch Version (e.g., 1.4): 1.10.1
  • Ignite Version (e.g., 0.3.0): 0.4.7
  • OS (e.g., Linux):
  • How you installed Ignite (conda, pip, source): conda
  • Python version: 3.9.7
  • Any other relevant information:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
egaznepcommented, Feb 16, 2022

@egaznep could you please check your implementation if it also covers the issue we had previously here. Looks like if (engine.state.iteration + 1) % gradient_accumulation_steps == 0 should fix the issue we had previously. If you’d like to send a draft PR for that, it would be nice ! Thanks

I was wrong actually, it should be (engine.state.iteration - 1). I tested for two cases (gradient_accumulation_steps = 1 and gradient_accumulation_steps = 3). The behavior is as follows.

gradient_accumulation_steps = 1:

  1. zero_grad, accumulate, step, (ITERATION_COMPLETED)
  2. zero_grad, accumulate, step, (ITERATION_COMPLETED) …

gradient_accumulation_steps = 3:

  1. zero_grad, accumulate, (ITERATION_COMPLETED)
  2. accumulate, (ITERATION_COMPLETED)
  3. accumulate, step, (ITERATION_COMPLETED)
  4. zero_grad, accumulate, (ITERATION_COMPLETED)
  5. accumulate, (ITERATION_COMPLETED)
  6. accumulate, step, (ITERATION_COMPLETED) …

This way we both accumulate gradients in the desired manner and they are not flushed once iteration is finished, allowing for logging. The PR is on the way!

1reaction
vfdev-5commented, Feb 12, 2022

@egaznep in the beginning we implemented it in a similar way and there was a bug in the implementation and we switched to the current implementation. We have to check carefully your solution to ensure that it works perfectly. I haven’t yet checked it in details just giving a bit of context.

Read more comments on GitHub >

github_iconTop Results From Across the Web

tensorboard_logger — PyTorch-Ignite v0.4.10 Documentation
TensorBoard handler to log metrics, model/optimizer parameters, gradients during the training and validation. By default, this class favors ...
Read more >
Change Log — TOAST documentation - Docs
Set gradient of Arrivals perspective to SSH gradient instead of default gradient. Now it's synchronized with the color of the station on the...
Read more >
ignite.metrics.RunningAverage Example - Program Talk
Here are the examples of the python api ignite.metrics. ... tokenizer) # Training function and trainer def update(engine, batch): model.train() batch ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found