double backward always returns nan when dtype is float16 and cudnn is enabled.
See original GitHub issueWhen a pair of F.reshape
and F.batch_normalization
is used under the condition that dtype is fload16 and use_cudnn=‘always’, double backward of the pair goes so unstable that it returns nan
with high probability.
One use-case of the pair is F.group_normalization
:
https://github.com/chainer/chainer/blob/afe903389d822583a5355e9d46e6766d048ebeb5/chainer/functions/normalization/group_normalization.py#L61-L72
- Conditions
Platform: Linux-4.4.0-98-generic-x86_64-with-debian-stretch-sid
Chainer: 6.0.0b2
NumPy: 1.15.4
CuPy:
CuPy Version : 6.0.0b2
CUDA Root : /usr/local/cuda
CUDA Build Version : 9020
CUDA Driver Version : 9020
CUDA Runtime Version : 9020
cuDNN Build Version : 7201
cuDNN Version : 7201
NCCL Build Version : None
iDeep: 2.0.0.post3
- Code to reproduce
import cupy as cp
from chainer import gradient_check
import chainer.functions as F
import numpy
def reshape_and_bn(x, gamma, beta):
x_shape = x.shape
expander = [None, Ellipsis, None, None]
x_ = F.reshape(x, (1, x_shape[0] * x_shape[1], -1, 1))
dummy_g = cp.ones(x_.shape[1], dtype=x_.dtype)
dummy_b = cp.zeros(x_.shape[1], dtype=x_.dtype)
x_normalized = F.batch_normalization(x_, dummy_g, dummy_b)
x_normalized = F.reshape(x_normalized, x_shape)
gamma = gamma[expander]
beta = beta[expander]
return x_normalized * gamma + beta
def run():
x = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
gy = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
ggx = cp.random.uniform(-1, 1, (5, 3, 4, 4)).astype(cp.float16)
gamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
beta = cp.random.uniform(-1, 1, 3).astype(cp.float16)
ggamma = cp.random.uniform(-1, 1, 3).astype(cp.float16)
gbeta = cp.random.uniform(-1, 1, 3).astype(cp.float16)
print('Backward')
gradient_check.check_backward(
reshape_and_bn, (x, gamma, beta,), (gy,), dtype=numpy.float64,
atol=1e-2, rtol=1e-3
)
print('Double Backward')
gradient_check.check_double_backward(
reshape_and_bn, (x, gamma, beta), (gy,),
(ggx, ggamma, gbeta), dtype=numpy.float64,
atol=1e-2, rtol=1e-3
)
if __name__ == '__main__':
run()
- Error messages, stack traces, or logs
For backward,
gradients (numeric): 0.6409951020032167
gradients (backward): -0.5768083848859537
Not equal to tolerance rtol=0.001, atol=0.01
Mismatch: 100%
Max absolute difference: 1.21780349
Max relative difference: 2.1112791
x: array(0.640995)
y: array(-0.576808)
assert_allclose failed:
shape: () ()
dtype: float64 float64
i: (0,)
x[i]: 0.6409951020032167
y[i]: -0.5768083848859537
relative error[i]: 2.1112790985691965
absolute error[i]: 1.2178034868891703
x: 0.6409951
y: -0.57680838
For double backward,
gradients (numeric): 1.773324329406023
gradients (backward): nan
Not equal to tolerance rtol=0.001, atol=0.01
x and y nan location mismatch:
x: array(1.773324)
y: array(nan)
assert_allclose failed:
shape: () ()
dtype: float64 float64
i: (0,)
x[i]: 1.773324329406023
y[i]: nan
relative error[i]: nan
absolute error[i]: nan
x: 1.77332433
y: nan
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
Train With Mixed Precision - NVIDIA Documentation Center
When enabled, automatic mixed precision will do two things: Insert the appropriate cast operations into your TensorFlow graph to use float16 ...
Read more >Training in mixed precision - | notebook.community
We do this in two steps: first we convert the model to FP16, then we loop over all the layers and put them...
Read more >Training with mixed precision: loss is NaN despite finite output ...
RuntimeError: Function 'NativeDropoutBackward0' returned nan values in its 0th output. The trace back points to the next line, attn = self.
Read more >Optimize PyTorch Performance for Speed and Memory ...
Second, overlap the processes as much as possible to save time. Third, maximize the memory usage efficiency to save memory. Then saving memory...
Read more >CuPy Documentation - Read the Docs
CuPy is a NumPy/SciPy-compatible array library for ... does not support a dtype argument and instead always returns a float64 value. We.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@grafi-tt I tried the snippet with current master, and it worked.
I will check if this issue is related to #6323 .