Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

High Variance on RNN-based PPO agents

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Ray installed from (source or binary): Source
Ray version: 0.6.2
Python version: 3.6.7
Exact command to reproduce: use_lstm: True

Describe the problem

I’ve noticed that during PPO training, policy loss variance is incredibly high when an RNN is added compared to when it is left out. See below, where the first picture is from a custom scenario I’m training (red is normal, green is a custom GRU, and pink is LSTM) , and the second is from stateless CartPole (orange is with LSTM, blue is without, NB: I forced the examples script to not load a LSTM).

In the case of CartPole, the policy with LSTM still learns how to ‘solve’ the scenario of course, but I’ve noticed that performance doesn’t seem as high as expected in more complex scenarios.

Therefore I was wondering if this higher variance on the policy loss was expected behaviour when adding an RNN network, and if so why this is the case?

BTW, I tried to do a bit of a deep dive, and noticed that the high variance mainly stems from the advantage calculation. I therefore added an RNN to the value function as well (similar approach to here) but this did not curb this behaviour.

Issue Analytics

State:
Created 5 years ago
Comments:11 (7 by maintainers)

Top GitHub Comments

1reaction

philipjballcommented, Mar 22, 2019

Hi Eric,

So did a bit more experimentation, and I now believe that increase in variance I observed was in-keeping with the environment.

I was able to achieve similar variance in non-RNN policy networks by putting certain constraints on the agent (i.e., forcing it to explore new states). This suggests that the RNNs were having a similar effect, but I would need to view the rollouts to fully confirm. Furthermore I did NOT notice a stacking effect (i.e. LSTM + constraints giving double the variance); instead the variance seemed rather dichotomous, either the small or large plots I showed above, nothing in between.

So whilst the variance issue seems relatively moot now, RNNs were still decreasing overall performance, and this merits further investigation, but it appears that the variance is not a major contributor to this. If I have any updates about this I will let you know.

0reactions

ericlcommented, Mar 21, 2019

Closing this since it can’t be reproduced – feel free to reopen.

Top Results From Across the Web

How to fix high variance of the returns on a 2d env?

I use PPO implementation from Stable Baselines, but the return variance just gets bigger over the training. In addition to that, the agent...

tf_agents.agents.PPOClipAgent - TensorFlow

A PPO Agent implementing the clipped probability ratios. ... normalize_rewards, If true, keeps moving variance of rewards and normalizes incoming rewards.

Learning Efficient Long-Term Memory by Predicting ... - arXiv

steps. Several rollout lengths are tested for RNN-based baselines. We could not successfully train a single PPO-LSTM or AMRL agent even with ...

Event Discovery for History Representation in Reinforcement ...

event discovery to represent the history for the agent in RL. ... they must atleast, use their RNN based architectures to modify PPO...

Meta-Policy Gradients: A Survey. Automated Hyperparameter ...

Training an RNN-based agent with internal memory on a task distribution ... a truncation bias, while larger K may suffer from high variance....