High Variance on RNN-based PPO agents
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Ray installed from (source or binary): Source
- Ray version: 0.6.2
- Python version: 3.6.7
- Exact command to reproduce:
use_lstm: True
Describe the problem
I’ve noticed that during PPO training, policy loss variance is incredibly high when an RNN is added compared to when it is left out. See below, where the first picture is from a custom scenario I’m training (red is normal, green is a custom GRU, and pink is LSTM) , and the second is from stateless CartPole (orange is with LSTM, blue is without, NB: I forced the examples script to not load a LSTM).
In the case of CartPole, the policy with LSTM still learns how to ‘solve’ the scenario of course, but I’ve noticed that performance doesn’t seem as high as expected in more complex scenarios.
Therefore I was wondering if this higher variance on the policy loss was expected behaviour when adding an RNN network, and if so why this is the case?
BTW, I tried to do a bit of a deep dive, and noticed that the high variance mainly stems from the advantage calculation. I therefore added an RNN to the value function as well (similar approach to here) but this did not curb this behaviour.
Issue Analytics
- State:
- Created 5 years ago
- Comments:11 (7 by maintainers)
Top GitHub Comments
Hi Eric,
So did a bit more experimentation, and I now believe that increase in variance I observed was in-keeping with the environment.
I was able to achieve similar variance in non-RNN policy networks by putting certain constraints on the agent (i.e., forcing it to explore new states). This suggests that the RNNs were having a similar effect, but I would need to view the rollouts to fully confirm. Furthermore I did NOT notice a stacking effect (i.e. LSTM + constraints giving double the variance); instead the variance seemed rather dichotomous, either the small or large plots I showed above, nothing in between.
So whilst the variance issue seems relatively moot now, RNNs were still decreasing overall performance, and this merits further investigation, but it appears that the variance is not a major contributor to this. If I have any updates about this I will let you know.
Closing this since it can’t be reproduced – feel free to reopen.