Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Questions about MlpLstmPolicy

See original GitHub issue

I successfully implemented PPO2 with MlpPolicy with two different custom environments I built. Now I want to extend to MlpLstmPolicy in one of my games.

I tried to understand the MlpLstmPolicy by reading the source code but it’s a bit involved. So several questions:

If successfully implemented, does the LSTM memorize the steps taken in a game only? Or does it also memorize what steps it took in the previous games (before resetting)?

Follow up question on this, if the answer to the second question is no, is there any way to achieve this? Concretely, I want my agent to come up with paths that are vastly different with the previous games (quantitatively measured by correlation). Implementing curiosity might seem to help, but it is not directly learning to find paths distinct from the previous games.

What role does the variable nminibatches play in training? Does it only affect the training speed?
I tried replacing MlpPolicy with MlpLstmPolicy in my game directly without changing anything, and it appears that the learning is much worse - even after many more learning steps, the reward is far worse than that learnt with MlpPolicy. Are there general tips to using MlpLstmPolicy / necessary modifications when switching from MlpPolicy to MlpLstmPolicy?

Thanks a million in advance!

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:14

Top GitHub Comments

4reactions

Miffylicommented, Jan 8, 2020

LSTM only memorizes past inside the single game, it does not remember things outside that episode.
nminibatches specifies the number of minibatches to use when updating the policy on gathered samples. E.g. if you have 1000 samples gathered in total and nminibatches=4, it will split samples into four minibatches of 250 elements and do parameter updates on these batches noptepochs times.
LSTMs are generally harder to train than non-recurrent networks (more parameters, gradients are dependent over multiple timesteps, etc etc), and the implementation here is probably not one of the best (see e.g. R2D2 paper on research on this). I would run it at least 5x longer than non-recurrent version to see when/if the learning starts to happen later.

If you feel something in docs was not clear on these questions, please point them out so we can fix these 😃

1reaction

araffincommented, Jan 10, 2020

Question was on how to test the current LSTM implementation if it works right, and so far there was trouble to solve a simple recall environment.

@Miffyli

We have a test for that 😉 https://github.com/hill-a/stable-baselines/blob/master/tests/test_lstm_policy.py#L43 (see PR https://github.com/hill-a/stable-baselines/pull/244)

Top Results From Across the Web

LSTM based policy in stable baselines3 model - Stack Overflow

I want to use a policy network with an LSTM layer in it. However, I can't find such a possibility on the library's...

Policy Networks — Stable Baselines 2.10.3a0 documentation

MlpLstmPolicy, Policy object that implements actor critic, using LSTMs with a MLP feature extraction. MlpLnLstmPolicy, Policy object that implements actor ...

What is a high performing network architecture to use in a ...

Artificial Intelligence Stack Exchange is a question and answer site for people interested in conceptual questions about life and challenges ...

Shape of observation space for LSTM policies in OpenAI and ...

This is a simple question and yet one, for which I did not really find a straight forward answer. Suppose, I want to...

Reinforcement Learning Study Group Report – February 2021

entering observations into the policy networks raised questions regarding ... We modify the baseline experiments to use an MlpLstmPolicy instead of.