Possible numerical Instability of gradient calculation in PPO2 (?)
See original GitHub issueFirst of all, I’m not really sure whether this is a problem on my side or a bug on your side. But I’m trying to debug this for some days now and I really don’t know what to do anymore.
Bug description
The bug I’m facing is easily described: while training I get NaN values while training a MlpPolicy with PPO2 on a custom environment I’m writing for my master’s thesis.
The stacktrace is the following:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node loss/VerifyFinite/CheckNumerics}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "train.py", line 90, in <module>
model.learn(config["ppo"]["num_timesteps"])
File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 307, in learn
update=timestep))
File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 261, in _train_step
[self.pg_loss, self.vf_loss, self.entropy, self.approxkl, self.clipfrac, self._train], td_map)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node loss/VerifyFinite/CheckNumerics (defined at /home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py:175) ]]
Caused by op 'loss/VerifyFinite/CheckNumerics', defined at:
File "train.py", line 81, in <module>
ent_coef=config["ppo"]["entropy_coefficient"],
File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 93, in __init__
self.setup_model()
File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 175, in setup_model
grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
return verify_tensor_all_finite_v2(t, msg, name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
verify_input = array_ops.check_numerics(x, message=message)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node loss/VerifyFinite/CheckNumerics (defined at /home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py:175) ]]
It looks like the NaNs are occuring in this call of tf.gradients
. For further debugging I added some assertions:
diff --git a/stable_baselines/ppo2/ppo2.py b/stable_baselines/ppo2/ppo2.py
index eb009ce..0af1e9e 100644
--- a/stable_baselines/ppo2/ppo2.py
+++ b/stable_baselines/ppo2/ppo2.py
@@ -170,7 +170,14 @@ class PPO2(ActorCriticRLModel):
if self.full_tensorboard_log:
for var in self.params:
tf.summary.histogram(var.name, var)
+
+ loss = tf.debugging.assert_all_finite(loss, msg="rip loss")
+
grads = tf.gradients(loss, self.params)
+
+ grads = [ tf.debugging.assert_all_finite(grad, msg=f"rip grad{i}") if grad is not None else None
+ for i, grad in enumerate(grads) ]
+
if self.max_grad_norm is not None:
grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
grads = list(zip(grads, self.params))
With those assertions added, I’m really sure that the tf.gradients
call is the problem and the NaNs aren’t propagated from the loss
variable, since the gradient with the index of 14 is the one that raises the error.
Googling leads me to the assumption that this has to do with the numerical instability of the gradient calculation, so I thought it might be possible to add an epsilon ontop of the loss
variable.
+ eps = tf.constant(1e-7)
+ loss = tf.add(loss, eps)
Sadly, this doesn’t help and the error persists. I’m not really sure what to do next and it doesn’t help that any test needs multiple hours to verify.
Code example
I can’t provide a minimal code example and the problem occurs only after one to three hour training on my machine, but I’ll happily test anything anyone suggests. I am grateful for every comment, I really have to fix this.
System Info
I don’t think this is a hardware- or installation problem, but I’ll add the system info:
- I installed via
pip install -e .
- We have a Titan X and a RTX 2060 for training
- We’re using Python 3.6.7
- We’re using Tensorflow 1.13.1
- We’re using Cuda 10.0
- I don’t think there are other relevant libraries
Issue Analytics
- State:
- Created 4 years ago
- Comments:8
Top GitHub Comments
For everyone that stumbles upon this issue via google: For my case it looks like I had an entropy coefficient that was way too high.
The fact that bad chosen hyperparameters can result in NaNs inside the gradients calculation threw me really off, I’m closing this now! Thanks for the pointer!
I would rather recommend looking at:
rather than having a pre-defined range.
Also, you should use at first automatic parameter tuning (available in the rl zoo) which saves a lot of effort compared to tuning by hand 😉.
I agree that this is not the duty of SB. And if you change the default hyperparams, you should know what you are doing.