Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Document upstream gradient behavior for functions with multiple outputs

See original GitHub issue

When a function has several outputs and the user backwards through one of them, it is not obvious that all of the other upstream gradients are collected. This is probably most true when the last function in the graph has multiple outputs. The following is such a case.

import numpy as np
import chainer
from chainer import Variable

x = chainer.Variable(np.arange(4, dtype='f'))
ys = chainer.functions.split_axis(x, 2, axis=0)
for y in ys:
    y.grad_var = chainer.Variable(np.full_like(y, 3, dtype='f'))
ys[0].backward()
x.grad  # [3, 3, 3, 3]. Some users might expect [3, 3, 0, 0] since ys[1] is not involved?

How about document this behavior, or is it already?

Issue Analytics

State:
Created 5 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

beam2dcommented, Nov 27, 2018

This feature has been used through a trick when one wants to start backprop from multiple root variables with Variable.backward. When one wants to start backprop from variables x and y, one can feed them to F.identity and then call backward on one of the outputs.

x, y = F.identity(x, y)
x.grad = ...
y.grad = ...
x.backward()

I know that this snippet is too tricky, and it’s not good to let users rely on such a trick, but we at least should provide an alternative way to accomplish the same goal if we remove the feature discussed in this issue. One idea is to provide a functional version of Variable.backward, say chainer.backward(ys), which accepts multiple Variables to start with.

0reactions

stale[bot]commented, Oct 30, 2019

This issue is closed as announced. Feel free to re-open it if needed.

Top Results From Across the Web

slides - with Deep Learning CS224N/Ling284

From one-layer to mul' layer neural networks! • Fully vectorized gradient computa'on. • The backpropaga'on algorithm. • (Time permi\ng) Class project 'ps.

Why the sigmoid activation function results in sub-optimal ...

Two primary reasons sigmoid is a sub-optimal activation function for gradient descent: A node's activation saturates at either tail of 0 or ...

tf.custom_gradient | TensorFlow v2.11.0

The variable upstream is defined as the upstream gradient. i.e. the gradient from all the layers or functions originating from this layer. The ......

Seemingly random shape error during gradient calculation #325

I wasn't able to find the related stack output or input shapes, so I can't tell if the shape error is caused by...

Train With Mixed Precision - NVIDIA Documentation Center

Adding loss scaling to preserve small gradient values. ... While many networks match FP32 training results when all tensors are stored in ...