Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LinAlgError (Array must not contain infs or NaNs) thrown in get_mu_tensor

See original GitHub issue

Below is a simple piece of code to try YellowFin on my dataset.

x = tf.placeholder( tf.float32, [ None, train_x.shape[ 1 ] ] )
y = tf.placeholder( tf.float32, [ None, train_y.shape[ 1 ] ] )
m = tf.layers.dense( x, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, hidden_dim )
m = tf.layers.batch_normalization( m )
m = tf.nn.elu( m )
m = tf.layers.dense( m, train_y.shape[ 1 ] )
prediction = tf.nn.softmax( m )
loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits( labels=y, logits=m ) )
optimizer = yellowfin.YFOptimizer().minimize( loss )

s = tf.Session()
s.run( tf.global_variables_initializer() )
for epoch in range( epochs ):
    _, h = s.run( [ optimizer, loss ], feed_dict={ x: train_x, y: train_y } )

Usually, it crashes and throws the following exception.

Caused by op 'update_hyper/cond/PyFuncStateless', defined at:
  File "test2.py", line 47, in <module>
    optimizer = yf.YFOptimizer( learning_rate=1., momentum=0. ).minimize( loss )
  File "/data/python-mp-test/libs/yellowfin.py", line 268, in minimize
    return self.apply_gradients(grads_and_vars)
  File "/data/python-mp-test/libs/yellowfin.py", line 223, in apply_gradients
    update_hyper_op = self.update_hyper_param()
  File "/data/python-mp-test/libs/yellowfin.py", line 191, in update_hyper_param
    lambda: self._mu_var) )
  File "/usr/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch
    original_result = fn()
  File "/data/python-mp-test/libs/yellowfin.py", line 190, in <lambda>
    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),
  File "/data/python-mp-test/libs/yellowfin.py", line 173, in get_mu_tensor
    roots = tf.py_func(np.roots, [coef], Tout=tf.complex64, stateful=False)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 201, in py_func
    input=inp, token=token, Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/ops/gen_script_ops.py", line 56, in _py_func_stateless
    Tout=Tout, name=name)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): LinAlgError: Array must not contain infs or NaNs
	 [[Node: update_hyper/cond/PyFuncStateless = PyFuncStateless[Tin=[DT_FLOAT], Tout=[DT_COMPLEX64], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/cpu:0"](update_hyper/cond/ScatterUpdate)]]

Issue Analytics

State:
Created 6 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

JianGoForItcommented, Jul 14, 2017

Hi @ywchan2005,

Thanks for trying out the optimizer. This is mostly because of the exploding gradient in the middle of training.

If it happens in the very beginning, you might want to play with the initial value a bit.
If it is in the middle of training, please consider using gradient clipping. There is discussion with solutions in our PyTorch YellowFin repo here. Similar solution can apply to the TF repo.
We are working on a better auto gradient clipping feature. You may also wait for that in a few days. But I suggest you can already start working on 2.

0reactions

staticfloatcommented, Jan 27, 2018

I can confirm that I ran into this using a standard AlexNet architecture being trained on the ImageNet corpus using PyTorch. After 7 full epochs, (that is, having trained on 60928 minibatches, each of size 64) I received the following error:

yellowfin.py:192: RuntimeWarning: invalid value encountered in add
  self._grad_var += global_state['grad_norm_squared_avg'] / debias_factor
/var/storage/shared/msrlabs/sabae/libsmolder_autodeploy/libsmolder/optimizers/yellowfin.py:329: RuntimeWarning: invalid value encountered in double_scalars
  self._mu_t = max(root**2, ( (np.sqrt(dr) - 1) / (np.sqrt(dr) + 1) )**2 )

It would be really nice to have gradient clipping or some kind of workaround for this built-in to YellowFin. 😃