Bad performance of LSTM policies with PPO
See original GitHub issueSystem information
- OS Platform and Distribution: Ubuntu 18.04
- Ray installed from (source or binary): source (master)
- Ray version: 0.8.0.dev2
- Python version: 3.7
Problem
LSTM policies can’t match the performance of feed-forward policies even on tasks where having state (memory) gives an advantage

Here orange agent is feed-forward, blue is RNN. Policies are trained for the same amount of wall-time. In this particular task the agent has to remember the color of the object in the middle of the room and collect objects of the corresponding color to maximize reward. This is not a standard environment, but I’ve observed similar performance gap across many task and hyperparam combinations. This video demonstrates the task: https://github.com/edbeeching/3d_control_deep_rl/blob/master/videos/two_color_example.gif
Here’s my current config:
env: doom_two_colors_fixed
run: PPO
config:
    lr: 0.0001
    lambda: 0.95
    kl_coeff: 0.5
    clip_rewards: False
    clip_param: 0.1
    vf_clip_param: 100000.0
    entropy_coeff: 0.0005
    train_batch_size: 6144
    sample_batch_size: 64
    sgd_minibatch_size: 512
    num_sgd_iter: 4
    num_workers: 18
    num_envs_per_worker: 8
    batch_mode: truncate_episodes
    observation_filter: NoFilter
    vf_share_layers: true
    num_gpus: 1
    model:
        custom_model: vizdoom_vision_model
        conv_filters: [
            [32, [8, 8], 4],
            [64, [4, 4], 2],
            [64, [3, 3], 2],
            [128, [3, 3], 2],
        ]
        conv_activation: elu
        fcnet_activation: elu  # was tanh
        use_lstm: True
        max_seq_len: 32
        lstm_cell_size: 256
        lstm_use_prev_action_reward: False
        framestack: False
        grayscale: False
        zero_mean: False
I tried many hyperparameter combinations and it didn’t lead to improvement: larger minibatch size, set kl_coeff to 0, change entropy term, sgd_iter, larger max_seq_len.
I might be missing something important, so suggestions are very welcome!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:4
- Comments:34 (20 by maintainers)

 Top Related Medium Post
Top Related Medium Post Top Related StackOverflow Question
Top Related StackOverflow Question
Here’s some data points for PPO and IMPALA on Breakout:
Overall it seems the LSTM policy is able to be successful on Breakout, though not quite as fast as the framestacked solution. Interestingly, the choice of optimizer has a huge impact: for IMPALA Adam flatlines, while for PPO it’s RMSprop that doesn’t work. I don’t think it’s surprising that a LSTM policy takes longer to learn, but it does seem to be much more brittle with respect to hyperparameters.
The full hyperparameters:
My issue turned out to be enabling CuDNN at the driver level. I’ve been using the PPO algorithm both on StarCraft and other environments. It seems to work just fine. Some tips: pay attention to what @ericl mentioned regarding the vf_coeff. The vf_loss could be much larger than the policy loss (depending on the game), so use the vf_coeff to scale it down inline with the policy loss. Pay attention to the LSTM architecture. There’s some subtleties in the architecture that may impact peformance. I based my architecture off of: https://github.com/ray-project/ray/blob/master/rllib/examples/custom_keras_rnn_model.py From what I can tell, LSTMs train more slowly but will reach a higher ultimate average reward. FF networks will train faster in the beginning but will max out earlier. Still not sure why exactly. But if you follow OpenAI’s work on both their Dota2 Five network architecture and their robotic arm manipulation paper then you’ll see that they highly recommend using an LSTM. Lastly, I would advise you to be patient with the training run. Any environment complex enough to need an LSTM and scalable PPO is going to take awhile to train and see results. I literally spent weeks iterating through the PPO hyperparameters and network architecture. Some of that time was spent upgrading to the latest keras API but much was also spent trying to get hyperparameters that resulted in replicable results. Try to change one parameter at a time if possible and remember there’s a lot of variance in results regardless of making changes. So, try to pick out signal from noise.