[Feature Request] Recompute the advantage of a minibatch in ppo
See original GitHub issue🚀 Feature
According to this paper, recomputing the advantage can be helpful for the PPO performance.
The function is provided by tianshou
library.
But I don’t know how to add this in sb3. Some hints about how to do that would be very helpful.
Thanks!
Motivation
I am comparing stable-baselines3, tianshou and rllib for the best performance of PPO.
Pitch
Recompute the advantage in learning ppo.
### Checklist
- I have checked that there is no similar issue in the repo (required)
Issue Analytics
- State:
- Created 2 years ago
- Comments:16 (7 by maintainers)
Top Results From Across the Web
Why mini batch size is better than one single "batch" with all ...
The minibatch methodology is a compromise that injects enough noise to each gradient update, while achieving a relative speedy convergence. 1 Bottou, L....
Read more >A Gentle Introduction to Mini-Batch Gradient Descent and How ...
It works by having the model make predictions on training data and using the error on the predictions to update the model in...
Read more >The 37 Implementation Details of Proximal Policy Optimization
After calculating the advantages based on GAE, PPO normalizes the advantages by subtracting their mean and dividing them by their standard ...
Read more >[Q] Using minibatches in PPO/Policy gradient updates - Reddit
In an ideal world you would grab those, and update your network with 1000 experiences, and in the case of PPO, you could...
Read more >arXiv:1810.02541v9 [cs.LG] 3 Nov 2020
algorithms to achieve this is Proximal Policy Optimization (PPO) ... However, using only positive advantage actions guarantees that.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hello, I recompute it after each epoch, being consistent with the library
tianshou
https://github.com/thu-ml/tianshou/blob/655d5fb14fe85ea9da86b441456286fa1f078384/tianshou/policy/modelfree/ppo.py#L107I pasted the main modification below. Hopefully you can help check if there is any potential problems.
To support multi-envs, I did what you suggested before, avoid overwriting the variables https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/common/buffers.py#L441, and reshape them whenever sampling. We don’t have to do this if we only use one env. But reshaping when sampling heavilly slows low the learning process… Do you have a good solution for this?
I also tried recomputing it after sampling
bs
observations so this is each gradient step yesterday. I returned the sample indices with the sampled data. But anyway unluckily, I didn’t sucessfully make it :<solved…😃 Thanks