How can we train LSTM PPO?
See original GitHub issueHi, I am trying to train LSTM PPO on Hopper-v3 but it did not learn well. Although LSTM policy is hard to learn compared to FF policy, it seems there are several missing pieces to train them.
Could you give some advice to train LSTM PPO?
Thank you so much.
cf. my settings are as follows:
algo=dict(
discount=0.99,
learning_rate=3e-4,
clip_grad_norm=1e6,
entropy_loss_coeff=0.0,
gae_lambda=0.95,
minibatches=128,
epochs=10,
ratio_clip=0.2,
normalize_advantage=True,
linear_lr_schedule=True,
bootstrap_timelimit=False,
),
sampler=dict(
batch_B=16,
max_decorrelation_steps=400,
),
I tried batch_T as 2048, 256, 40, 32, 16
cf2. LSTM did not increase after finishing point.
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (3 by maintainers)
Top Results From Across the Web
A PPO+LSTM Guide - Nikos Pitsillos
We initially need to sample a batch of timesteps. We need to consider these as the starting points of the sequences which we...
Read more >PPO+LSTM Implementation : r/reinforcementlearning - Reddit
Hello can someone point to a repository that contains a PPO+LSTM implementation along with an explanatory blog post or something of that ...
Read more >Proximal Policy Optimisation with PyTorch using Recurrent ...
When capturing a trajectory for training a model it is easy to initialise the LSTM hidden state and cell state to zero. Then...
Read more >Recurrent PPO - Stable Baselines3 Contrib docs!
Recurrent policy class for actor-critic algorithms (has both policy and value prediction). To be used with A2C, PPO and the likes. It assumes...
Read more >Stale hidden states in PPO-LSTM - Kamal
I've been using Proximal Policy Optimization (PPO, Schulman et al. 2017) to train agents to accomplish gridworld tasks. The neural net ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
TLDR: I think I found a solution: set
value_loss_coeff=1e-3
andbatch_B=minibatches
.This issue is now closed but it seems that no solution was found; I was recently training a MujocoLstmAgent on a custom robot environment and initially had the same problem of the agent not being able to learn. After a quick inspection, I saw that the actor and the critic are implemented in a single network; the following call
print(agent.model)
yields:According to a similar issue (ray-project/ray#5278), the problem comes from the fact that the value function loss is huge when compared to the policy loss. The solution is to set the
value_loss_coeff
parameter to a low value (I used 1e-3) to balance the two losses. Below is the full configuration I used (I normally setbatch_B=minibatches
):Hope this helps someone 😄
Hi, @astooke. Unfortunately, due to other works, I did not try yet. I’ll close it and if I have any updates, I will reopen and share the results.
Thank you for your help.
@MasterScrat, you may add some lines regarding wandb on
dump_tabular
onlogger.py