Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Possible numerical Instability of gradient calculation in PPO2 (?)

See original GitHub issue

First of all, I’m not really sure whether this is a problem on my side or a bug on your side. But I’m trying to debug this for some days now and I really don’t know what to do anymore.

Bug description

The bug I’m facing is easily described: while training I get NaN values while training a MlpPolicy with PPO2 on a custom environment I’m writing for my master’s thesis.

The stacktrace is the following:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
         [[{{node loss/VerifyFinite/CheckNumerics}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 90, in <module>
    model.learn(config["ppo"]["num_timesteps"])
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 307, in learn
    update=timestep))
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 261, in _train_step
    [self.pg_loss, self.vf_loss, self.entropy, self.approxkl, self.clipfrac, self._train], td_map)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
         [[node loss/VerifyFinite/CheckNumerics (defined at /home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py:175) ]]

Caused by op 'loss/VerifyFinite/CheckNumerics', defined at:
  File "train.py", line 81, in <module>
    ent_coef=config["ppo"]["entropy_coefficient"],
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 93, in __init__
    self.setup_model()
  File "/home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py", line 175, in setup_model
    grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/clip_ops.py", line 271, in clip_by_global_norm
    "Found Inf or NaN global norm.")
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 44, in verify_tensor_all_finite
    return verify_tensor_all_finite_v2(t, msg, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 62, in verify_tensor_all_finite_v2
    verify_input = array_ops.check_numerics(x, message=message)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
         [[node loss/VerifyFinite/CheckNumerics (defined at /home/jkuball/Git/stable-baselines/stable_baselines/ppo2/ppo2.py:175) ]]

It looks like the NaNs are occuring in this call of tf.gradients. For further debugging I added some assertions:

diff --git a/stable_baselines/ppo2/ppo2.py b/stable_baselines/ppo2/ppo2.py
index eb009ce..0af1e9e 100644
--- a/stable_baselines/ppo2/ppo2.py
+++ b/stable_baselines/ppo2/ppo2.py
@@ -170,7 +170,14 @@ class PPO2(ActorCriticRLModel):
                         if self.full_tensorboard_log:
                             for var in self.params:
                                 tf.summary.histogram(var.name, var)
+
+                    loss = tf.debugging.assert_all_finite(loss, msg="rip loss")
+
                     grads = tf.gradients(loss, self.params)
+
+                    grads = [ tf.debugging.assert_all_finite(grad, msg=f"rip grad{i}") if grad is not None else None
+                              for i, grad in enumerate(grads) ]
+
                     if self.max_grad_norm is not None:
                         grads, _grad_norm = tf.clip_by_global_norm(grads, self.max_grad_norm)
                     grads = list(zip(grads, self.params))

With those assertions added, I’m really sure that the tf.gradients call is the problem and the NaNs aren’t propagated from the loss variable, since the gradient with the index of 14 is the one that raises the error.

Googling leads me to the assumption that this has to do with the numerical instability of the gradient calculation, so I thought it might be possible to add an epsilon ontop of the loss variable.

+                    eps = tf.constant(1e-7)
+                    loss = tf.add(loss, eps)

Sadly, this doesn’t help and the error persists. I’m not really sure what to do next and it doesn’t help that any test needs multiple hours to verify.

Code example

I can’t provide a minimal code example and the problem occurs only after one to three hour training on my machine, but I’ll happily test anything anyone suggests. I am grateful for every comment, I really have to fix this.

System Info

I don’t think this is a hardware- or installation problem, but I’ll add the system info:

I installed via pip install -e .
We have a Titan X and a RTX 2060 for training
We’re using Python 3.6.7
We’re using Tensorflow 1.13.1
We’re using Cuda 10.0
I don’t think there are other relevant libraries

Issue Analytics

State:
Created 4 years ago
Comments:8

Top GitHub Comments

8reactions

jkuballcommented, May 31, 2019

For everyone that stumbles upon this issue via google: For my case it looks like I had an entropy coefficient that was way too high.

The fact that bad chosen hyperparameters can result in NaNs inside the gradients calculation threw me really off, I’m closing this now! Thanks for the pointer!

0reactions

araffincommented, May 31, 2019

Maybe it’s good to add something like “usually between x and y” to the documentation for all parameters?

I would rather recommend looking at:

hyperparameters from the paper
tuned hyperparameters present in the rl zoo

rather than having a pre-defined range.

Also, you should use at first automatic parameter tuning (available in the rl zoo) which saves a lot of effort compared to tuning by hand 😉.

this is not the duty of stable-baselines

I agree that this is not the duty of SB. And if you change the default hyperparams, you should know what you are doing.

Top Results From Across the Web

PPO2 — Stable Baselines 2.10.3a0 documentation

PPO2 is the implementation of OpenAI made for GPU. For multiprocessing, it uses vectorized environments compared to PPO1 which uses MPI.

Proximal Policy Optimization (PPO) With TensorFlow 2.x

We calculate current probabilities and losses. The Critic loss is MSE. This function performs gradient updates using Gradient Tap. Actor Loss:.

5.4. Numerical Stability and Initialization

When our network boasts many layers, unless we are careful, the gradient will likely be cut off at some layer. Indeed, this problem...

Stable Baselines Documentation - Read the Docs

One last limitation of RL is the instability of training. ... For recurrent policies, with PPO2, the number of environments run in parallel....

Algorithms — Ray 2.2.0 - the Ray documentation

MARWIL is a hybrid imitation learning and policy gradient algorithm suitable for training on batched historical data. When the beta hyperparameter is set...