Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

IN PPO, clipping the value loss with max is OK?

See original GitHub issue

in file ‘pposgd_simple.py’ line 117,

vf_loss = .5 * U.mean(tf.maximum(vfloss1, vfloss2)) # we do the same clipping-based trust region for the value function

why not tf.minimum ?

Issue Analytics

State:
Created 6 years ago
Comments:8

Top GitHub Comments

4reactions

lezhang-thucommented, Aug 31, 2020

code from ppo2

        # Clip the value to reduce variability during Critic training
        # Get the predicted value
        vpred = train_model.vf
        vpredclipped = OLDVPRED + tf.clip_by_value(train_model.vf - OLDVPRED, - CLIPRANGE, CLIPRANGE)
        # Unclipped value
        vf_losses1 = tf.square(vpred - R)
        # Clipped value
        vf_losses2 = tf.square(vpredclipped - R)

        vf_loss = .5 * tf.reduce_mean(tf.maximum(vf_losses1, vf_losses2))

ppo alg. is sort of like trpo. both optimize the obj under the condition that the optimization is within a trust region. i just want to describe one case so you can picture it. consider the case when OLDVPRED < R. now we have to update train_model.vf, i.e. vpred, so vpred is closer to R after the update. case 1 is if vpred is within the trust region of OLDVPRED, then nothing needs to be done, tf.maximum(vf_losses1, vf_losses2) would degenerate to tf.square(vpred - R). case 2 if vpred is outside trust region of OLDVPRED. then there would be case 2.1 and case 2.2. case 2.1 is for the case when after the update vpred is closer to the trust region, and also closer to R. case 2.1 is a perfect case, as we need to update vpred to optimize the objective of closing to R, also to be closer to trust region. so for case 2.1, it happens when train_model.vf < OLDVPRED - CLIPRANGE. and now you’ll see why tf.maximum(vf_losses1, vf_losses2) is needed here, as we want to keep tf.square(vpred - R). case 2.2 is subtle. this is the case when after the update vpred is going away farther from the trust region, and also closer to R. for case 2.2 we cannot update, as that would disobey the spirit of trust region, i.e., all the updates should be done within the trust region or for the “within the trust region” condition to be more possible. so for case 2.2, it happens for example OLDVPRED + CLIPRANGE < train_model.vf < R. this time, tf.maximum then would choose vf_losses2. this is literally a constant, hence no grad, and no update ever happens.

hope this explanation helps. for anyone needed it.

0reactions

yueyang130commented, Nov 1, 2021

@lezhang-thu very clear and logical explanation！thanks！

Top Results From Across the Web

PPO Hyperparameters and Ranges - Medium

Clip parameter illustration from Schulman et al ... Explanation for the Value Function loss (2nd term) from the PPO paper:.

Decaying Clipping Range in Proximal Policy Optimization - arXiv

This simple yet powerful idea prevents large policy updates during optimization.

Proximal Policy Optimization Tutorial (Part 2/2: GAE and PPO ...

The value of epsilon is suggested to be kept at 0.2 in the paper. Critic loss is nothing but the usual mean squared...

RL - Policy Proximal Optimization and clipping - Cross Validated

Essentially, we look to increase the likelihood of an action, at, if the advantage function, At>0 and we clip the value of the...

Clipped Proximal Policy Optimization - GitHub Pages

Very similar to PPO, with several small (but very simplifying) changes: Train both the value and policy networks, simultaneously, by defining a single...