question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How can we train LSTM PPO?

See original GitHub issue

Hi, I am trying to train LSTM PPO on Hopper-v3 but it did not learn well. Although LSTM policy is hard to learn compared to FF policy, it seems there are several missing pieces to train them.

Could you give some advice to train LSTM PPO?

Thank you so much.

image

cf. my settings are as follows:

algo=dict(
        discount=0.99,
        learning_rate=3e-4,
        clip_grad_norm=1e6,
        entropy_loss_coeff=0.0,
        gae_lambda=0.95,
        minibatches=128,
        epochs=10,
        ratio_clip=0.2,
        normalize_advantage=True,
        linear_lr_schedule=True,
        bootstrap_timelimit=False,
    ),
    sampler=dict(
        batch_B=16,
        max_decorrelation_steps=400,
    ),

I tried batch_T as 2048, 256, 40, 32, 16

cf2. LSTM did not increase after finishing point.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
yanniddcommented, Aug 4, 2020

TLDR: I think I found a solution: set value_loss_coeff=1e-3 and batch_B=minibatches.

This issue is now closed but it seems that no solution was found; I was recently training a MujocoLstmAgent on a custom robot environment and initially had the same problem of the agent not being able to learn. After a quick inspection, I saw that the actor and the critic are implemented in a single network; the following call print(agent.model) yields:

MujocoLstmModel(
  (mlp): MlpModel(
    (model): Sequential(
      (0): Linear(in_features=77, out_features=256, bias=True)
      (1): ReLU()
      (2): Linear(in_features=256, out_features=256, bias=True)
      (3): ReLU()
    )
  )
  (lstm): LSTM(275, 256)
  (head): Linear(in_features=256, out_features=37, bias=True)
)

According to a similar issue (ray-project/ray#5278), the problem comes from the fact that the value function loss is huge when compared to the policy loss. The solution is to set the value_loss_coeff parameter to a low value (I used 1e-3) to balance the two losses. Below is the full configuration I used (I normally set batch_B=minibatches):

  sampler = Sampler(
      ...
      EnvCls=gym_make,
      CollectorCls=CpuWaitResetCollector,
      batch_T=256,
      batch_B=8,
      max_decorrelation_steps=0,
  )
  algo = PPO(
      ...
      discount=0.99,
      learning_rate=3e-4,
      clip_grad_norm=1e6,
      entropy_loss_coeff=0.0,
      gae_lambda=0.95,
      minibatches=8,
      epochs=10,
      value_loss_coeff=1e-3,
      ratio_clip=0.2,
      normalize_advantage=True,
      linear_lr_schedule=False,
  )
  agent = MujocoLstmAgent(model_kwargs=dict(normalize_observation=False))

Hope this helps someone 😄

1reaction
jd730commented, Mar 2, 2020

Hi, @astooke. Unfortunately, due to other works, I did not try yet. I’ll close it and if I have any updates, I will reopen and share the results.

Thank you for your help.

@MasterScrat, you may add some lines regarding wandb on dump_tabular on logger.py

Read more comments on GitHub >

github_iconTop Results From Across the Web

A PPO+LSTM Guide - Nikos Pitsillos
We initially need to sample a batch of timesteps. We need to consider these as the starting points of the sequences which we...
Read more >
PPO+LSTM Implementation : r/reinforcementlearning - Reddit
Hello can someone point to a repository that contains a PPO+LSTM implementation along with an explanatory blog post or something of that ...
Read more >
Proximal Policy Optimisation with PyTorch using Recurrent ...
When capturing a trajectory for training a model it is easy to initialise the LSTM hidden state and cell state to zero. Then...
Read more >
Recurrent PPO - Stable Baselines3 Contrib docs!
Recurrent policy class for actor-critic algorithms (has both policy and value prediction). To be used with A2C, PPO and the likes. It assumes...
Read more >
Stale hidden states in PPO-LSTM - Kamal
I've been using Proximal Policy Optimization (PPO, Schulman et al. 2017) to train agents to accomplish gridworld tasks. The neural net ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found