Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bad performance of LSTM policies with PPO

See original GitHub issue

System information

OS Platform and Distribution: Ubuntu 18.04
Ray installed from (source or binary): source (master)
Ray version: 0.8.0.dev2
Python version: 3.7

Problem

LSTM policies can’t match the performance of feed-forward policies even on tasks where having state (memory) gives an advantage

Here orange agent is feed-forward, blue is RNN. Policies are trained for the same amount of wall-time. In this particular task the agent has to remember the color of the object in the middle of the room and collect objects of the corresponding color to maximize reward. This is not a standard environment, but I’ve observed similar performance gap across many task and hyperparam combinations. This video demonstrates the task: https://github.com/edbeeching/3d_control_deep_rl/blob/master/videos/two_color_example.gif

Here’s my current config:

env: doom_two_colors_fixed
run: PPO
config:
    lr: 0.0001
    lambda: 0.95
    kl_coeff: 0.5
    clip_rewards: False
    clip_param: 0.1
    vf_clip_param: 100000.0
    entropy_coeff: 0.0005
    train_batch_size: 6144
    sample_batch_size: 64
    sgd_minibatch_size: 512
    num_sgd_iter: 4
    num_workers: 18
    num_envs_per_worker: 8
    batch_mode: truncate_episodes
    observation_filter: NoFilter
    vf_share_layers: true
    num_gpus: 1
    model:
        custom_model: vizdoom_vision_model
        conv_filters: [
            [32, [8, 8], 4],
            [64, [4, 4], 2],
            [64, [3, 3], 2],
            [128, [3, 3], 2],
        ]
        conv_activation: elu
        fcnet_activation: elu  # was tanh

        use_lstm: True
        max_seq_len: 32
        lstm_cell_size: 256
        lstm_use_prev_action_reward: False
        framestack: False
        grayscale: False
        zero_mean: False

I tried many hyperparameter combinations and it didn’t lead to improvement: larger minibatch size, set kl_coeff to 0, change entropy term, sgd_iter, larger max_seq_len.

I might be missing something important, so suggestions are very welcome!

Issue Analytics

State:
Created 4 years ago
Reactions:4
Comments:34 (20 by maintainers)

Top GitHub Comments

4reactions

ericlcommented, Aug 8, 2019

Here’s some data points for PPO and IMPALA on Breakout:

gray: IMPALA with framestacking, feedforward
dark blue: IMPALA with LSTM, RMSProp
light blue: PPO with LSTM, Adam
orange: PPO with LSTM, RMSProp (had to patch the code)
red: IMPALA with LSTM, Adam

Overall it seems the LSTM policy is able to be successful on Breakout, though not quite as fast as the framestacked solution. Interestingly, the choice of optimizer has a huge impact: for IMPALA Adam flatlines, while for PPO it’s RMSprop that doesn’t work. I don’t think it’s surprising that a LSTM policy takes longer to learn, but it does seem to be much more brittle with respect to hyperparameters.

The full hyperparameters:

atari-impala:
    env: BreakoutNoFrameskip-v4
    run: IMPALA
    config:
        sample_batch_size: 50
        train_batch_size: 500
        num_workers: 32
        num_envs_per_worker: 5
        clip_rewards: True
        lr_schedule: [
            [0, 0.0005],
            [20000000, 0.000000000001],
        ]
        opt_type:
            grid_search:
                - adam
                - rmsprop
        model:
            conv_activation: elu
            framestack: false
            use_lstm: true
atari-ppo:
    env: BreakoutNoFrameskip-v4
    run: PPO
    config:
        lambda: 0.95
        kl_coeff: 0.5
        clip_rewards: True
        clip_param: 0.1
        vf_clip_param: 10.0
        entropy_coeff: 0.01
        train_batch_size: 5000
        sample_batch_size: 100
        sgd_minibatch_size: 500
        num_sgd_iter: 10
        num_workers: 10
        num_envs_per_worker: 5
        batch_mode: truncate_episodes
        observation_filter: NoFilter
        vf_share_layers: true
        num_gpus: 1
        model:
            conv_activation: elu
            framestack: false
            use_lstm: true

2reactions

josjo80commented, Oct 31, 2019

My issue turned out to be enabling CuDNN at the driver level. I’ve been using the PPO algorithm both on StarCraft and other environments. It seems to work just fine. Some tips: pay attention to what @ericl mentioned regarding the vf_coeff. The vf_loss could be much larger than the policy loss (depending on the game), so use the vf_coeff to scale it down inline with the policy loss. Pay attention to the LSTM architecture. There’s some subtleties in the architecture that may impact peformance. I based my architecture off of: https://github.com/ray-project/ray/blob/master/rllib/examples/custom_keras_rnn_model.py From what I can tell, LSTMs train more slowly but will reach a higher ultimate average reward. FF networks will train faster in the beginning but will max out earlier. Still not sure why exactly. But if you follow OpenAI’s work on both their Dota2 Five network architecture and their robotic arm manipulation paper then you’ll see that they highly recommend using an LSTM. Lastly, I would advise you to be patient with the training run. Any environment complex enough to need an LSTM and scalable PPO is going to take awhile to train and see results. I literally spent weeks iterating through the PPO hyperparameters and network architecture. Some of that time was spent upgrading to the latest keras API but much was also spent trying to get hyperparameters that resulted in replicable results. Try to change one parameter at a time if possible and remember there’s a lot of variance in results regardless of making changes. So, try to pick out signal from noise.

Top Results From Across the Web

Stale hidden states in PPO-LSTM - Kamal

Parameter updates that cause big changes in agent behavior policies are often harmful to performance. Algorithms like Trust Region Policy ...

How can I consider the past without LSTM? - Reddit

Hi all, I am working in a stock market bot. My idea was to use stable baselines (SB) because they have LSTM policy...

The 37 Implementation Details of Proximal Policy Optimization

PPO is a policy gradient algorithm proposed by Schulman et al., (2017). As a refinement to Trust Region Policy Optimization (TRPO) (Schulman et ......

Processing flow of LSTM‐PPO model. PPO, proximal policy ...

Download scientific diagram | Processing flow of LSTM‐PPO model. PPO, proximal policy optimization. from publication: Improving traffic signal control ...

A brand new website interface for an even better experience!

Bad performance of LSTM policies with PPO.