Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How can we train LSTM PPO?

See original GitHub issue

Hi, I am trying to train LSTM PPO on Hopper-v3 but it did not learn well. Although LSTM policy is hard to learn compared to FF policy, it seems there are several missing pieces to train them.

Could you give some advice to train LSTM PPO?

Thank you so much.

cf. my settings are as follows:

algo=dict(
        discount=0.99,
        learning_rate=3e-4,
        clip_grad_norm=1e6,
        entropy_loss_coeff=0.0,
        gae_lambda=0.95,
        minibatches=128,
        epochs=10,
        ratio_clip=0.2,
        normalize_advantage=True,
        linear_lr_schedule=True,
        bootstrap_timelimit=False,
    ),
    sampler=dict(
        batch_B=16,
        max_decorrelation_steps=400,
    ),

I tried batch_T as 2048, 256, 40, 32, 16

cf2. LSTM did not increase after finishing point.

Issue Analytics

State:
Created 4 years ago
Comments:10 (3 by maintainers)

Top GitHub Comments

2reactions

yanniddcommented, Aug 4, 2020

TLDR: I think I found a solution: set value_loss_coeff=1e-3 and batch_B=minibatches.

This issue is now closed but it seems that no solution was found; I was recently training a MujocoLstmAgent on a custom robot environment and initially had the same problem of the agent not being able to learn. After a quick inspection, I saw that the actor and the critic are implemented in a single network; the following call print(agent.model) yields:

MujocoLstmModel(
  (mlp): MlpModel(
    (model): Sequential(
      (0): Linear(in_features=77, out_features=256, bias=True)
      (1): ReLU()
      (2): Linear(in_features=256, out_features=256, bias=True)
      (3): ReLU()
    )
  )
  (lstm): LSTM(275, 256)
  (head): Linear(in_features=256, out_features=37, bias=True)
)

According to a similar issue (ray-project/ray#5278), the problem comes from the fact that the value function loss is huge when compared to the policy loss. The solution is to set the value_loss_coeff parameter to a low value (I used 1e-3) to balance the two losses. Below is the full configuration I used (I normally set batch_B=minibatches):

  sampler = Sampler(
      ...
      EnvCls=gym_make,
      CollectorCls=CpuWaitResetCollector,
      batch_T=256,
      batch_B=8,
      max_decorrelation_steps=0,
  )
  algo = PPO(
      ...
      discount=0.99,
      learning_rate=3e-4,
      clip_grad_norm=1e6,
      entropy_loss_coeff=0.0,
      gae_lambda=0.95,
      minibatches=8,
      epochs=10,
      value_loss_coeff=1e-3,
      ratio_clip=0.2,
      normalize_advantage=True,
      linear_lr_schedule=False,
  )
  agent = MujocoLstmAgent(model_kwargs=dict(normalize_observation=False))

Hope this helps someone 😄

1reaction

jd730commented, Mar 2, 2020

Hi, @astooke. Unfortunately, due to other works, I did not try yet. I’ll close it and if I have any updates, I will reopen and share the results.

Thank you for your help.

@MasterScrat, you may add some lines regarding wandb on dump_tabular on logger.py