question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

double backward always returns nan when dtype is float16 and cudnn is enabled.

See original GitHub issue

When a pair of F.reshape and F.batch_normalization is used under the condition that dtype is fload16 and use_cudnn=‘always’, double backward of the pair goes so unstable that it returns nan with high probability.

One use-case of the pair is F.group_normalization: https://github.com/chainer/chainer/blob/afe903389d822583a5355e9d46e6766d048ebeb5/chainer/functions/normalization/group_normalization.py#L61-L72

  • Conditions
Platform: Linux-4.4.0-98-generic-x86_64-with-debian-stretch-sid
Chainer: 6.0.0b2
NumPy: 1.15.4
CuPy:
  CuPy Version          : 6.0.0b2
  CUDA Root             : /usr/local/cuda
  CUDA Build Version    : 9020
  CUDA Driver Version   : 9020
  CUDA Runtime Version  : 9020
  cuDNN Build Version   : 7201
  cuDNN Version         : 7201
  NCCL Build Version    : None
iDeep: 2.0.0.post3
  • Code to reproduce
import cupy as cp
from chainer import gradient_check
import chainer.functions as F
import numpy


def reshape_and_bn(x, gamma, beta):
    x_shape = x.shape
    expander = [None, Ellipsis, None, None]
    x_ = F.reshape(x, (1, x_shape[0] * x_shape[1], -1, 1))
    dummy_g = cp.ones(x_.shape[1], dtype=x_.dtype)
    dummy_b = cp.zeros(x_.shape[1], dtype=x_.dtype)
    x_normalized = F.batch_normalization(x_, dummy_g, dummy_b)
    x_normalized = F.reshape(x_normalized, x_shape)
    gamma = gamma[expander]
    beta = beta[expander]
    return x_normalized * gamma + beta


def run():
    x = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    gy = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    ggx = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
    gamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    beta = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    ggamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
    gbeta = cp.random.uniform(-1, 1, 3).astype(cp.float16)

    print('Backward')
    gradient_check.check_backward(
        reshape_and_bn, (x, gamma, beta,), (gy,), dtype=numpy.float64,
        atol=1e-2, rtol=1e-3
    )

    print('Double Backward')
    gradient_check.check_double_backward(
        reshape_and_bn, (x, gamma, beta), (gy,),
        (ggx, ggamma, gbeta), dtype=numpy.float64,
        atol=1e-2, rtol=1e-3
    )


if __name__ == '__main__':
    run()
  • Error messages, stack traces, or logs

For backward,

gradients (numeric):  0.6409951020032167
gradients (backward): -0.5768083848859537


Not equal to tolerance rtol=0.001, atol=0.01

Mismatch: 100%
Max absolute difference: 1.21780349
Max relative difference: 2.1112791
 x: array(0.640995)
 y: array(-0.576808)

assert_allclose failed:
  shape: () ()
  dtype: float64 float64
  i: (0,)
  x[i]: 0.6409951020032167
  y[i]: -0.5768083848859537
  relative error[i]: 2.1112790985691965
  absolute error[i]: 1.2178034868891703
x: 0.6409951
y: -0.57680838

For double backward,

gradients (numeric):  1.773324329406023
gradients (backward): nan


Not equal to tolerance rtol=0.001, atol=0.01

x and y nan location mismatch:
 x: array(1.773324)
 y: array(nan)

assert_allclose failed:
  shape: () ()
  dtype: float64 float64
  i: (0,)
  x[i]: 1.773324329406023
  y[i]: nan
  relative error[i]: nan
  absolute error[i]: nan
x: 1.77332433
y: nan

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
crcrparcommented, Mar 18, 2019

@grafi-tt I tried the snippet with current master, and it worked.

1reaction
takagicommented, Feb 26, 2019

I will check if this issue is related to #6323 .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Train With Mixed Precision - NVIDIA Documentation Center
When enabled, automatic mixed precision will do two things: Insert the appropriate cast operations into your TensorFlow graph to use float16 ...
Read more >
Training in mixed precision - | notebook.community
We do this in two steps: first we convert the model to FP16, then we loop over all the layers and put them...
Read more >
Training with mixed precision: loss is NaN despite finite output ...
RuntimeError: Function 'NativeDropoutBackward0' returned nan values in its 0th output. The trace back points to the next line, attn = self.
Read more >
Optimize PyTorch Performance for Speed and Memory ...
Second, overlap the processes as much as possible to save time. Third, maximize the memory usage efficiency to save memory. Then saving memory...
Read more >
CuPy Documentation - Read the Docs
CuPy is a NumPy/SciPy-compatible array library for ... does not support a dtype argument and instead always returns a float64 value. We.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found