IN PPO, clipping the value loss with max is OK?
See original GitHub issuein file ‘pposgd_simple.py’ line 117,
vf_loss = .5 * U.mean(tf.maximum(vfloss1, vfloss2)) # we do the same clipping-based trust region for the value function
why not tf.minimum ?
Issue Analytics
- State:
- Created 6 years ago
- Comments:8
Top Results From Across the Web
PPO Hyperparameters and Ranges - Medium
Clip parameter illustration from Schulman et al ... Explanation for the Value Function loss (2nd term) from the PPO paper:.
Read more >Decaying Clipping Range in Proximal Policy Optimization - arXiv
This simple yet powerful idea prevents large policy updates during optimization.
Read more >Proximal Policy Optimization Tutorial (Part 2/2: GAE and PPO ...
The value of epsilon is suggested to be kept at 0.2 in the paper. Critic loss is nothing but the usual mean squared...
Read more >RL - Policy Proximal Optimization and clipping - Cross Validated
Essentially, we look to increase the likelihood of an action, at, if the advantage function, At>0 and we clip the value of the...
Read more >Clipped Proximal Policy Optimization - GitHub Pages
Very similar to PPO, with several small (but very simplifying) changes: Train both the value and policy networks, simultaneously, by defining a single...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
code from ppo2
ppo alg. is sort of like trpo. both optimize the obj under the condition that the optimization is within a trust region. i just want to describe one case so you can picture it. consider the case when OLDVPRED < R. now we have to update train_model.vf, i.e. vpred, so vpred is closer to R after the update. case 1 is if vpred is within the trust region of OLDVPRED, then nothing needs to be done, tf.maximum(vf_losses1, vf_losses2) would degenerate to tf.square(vpred - R). case 2 if vpred is outside trust region of OLDVPRED. then there would be case 2.1 and case 2.2. case 2.1 is for the case when after the update vpred is closer to the trust region, and also closer to R. case 2.1 is a perfect case, as we need to update vpred to optimize the objective of closing to R, also to be closer to trust region. so for case 2.1, it happens when train_model.vf < OLDVPRED - CLIPRANGE. and now you’ll see why tf.maximum(vf_losses1, vf_losses2) is needed here, as we want to keep tf.square(vpred - R). case 2.2 is subtle. this is the case when after the update vpred is going away farther from the trust region, and also closer to R. for case 2.2 we cannot update, as that would disobey the spirit of trust region, i.e., all the updates should be done within the trust region or for the “within the trust region” condition to be more possible. so for case 2.2, it happens for example OLDVPRED + CLIPRANGE < train_model.vf < R. this time, tf.maximum then would choose vf_losses2. this is literally a constant, hence no grad, and no update ever happens.
hope this explanation helps. for anyone needed it.
@lezhang-thu very clear and logical explanation!thanks!