question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] Questions about MlpLstmPolicy

See original GitHub issue

I successfully implemented PPO2 with MlpPolicy with two different custom environments I built. Now I want to extend to MlpLstmPolicy in one of my games.

I tried to understand the MlpLstmPolicy by reading the source code but it’s a bit involved. So several questions:

  1. If successfully implemented, does the LSTM memorize the steps taken in a game only? Or does it also memorize what steps it took in the previous games (before resetting)?

Follow up question on this, if the answer to the second question is no, is there any way to achieve this? Concretely, I want my agent to come up with paths that are vastly different with the previous games (quantitatively measured by correlation). Implementing curiosity might seem to help, but it is not directly learning to find paths distinct from the previous games.

  1. What role does the variable nminibatches play in training? Does it only affect the training speed?

  2. I tried replacing MlpPolicy with MlpLstmPolicy in my game directly without changing anything, and it appears that the learning is much worse - even after many more learning steps, the reward is far worse than that learnt with MlpPolicy. Are there general tips to using MlpLstmPolicy / necessary modifications when switching from MlpPolicy to MlpLstmPolicy?

Thanks a million in advance!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:1
  • Comments:14

github_iconTop GitHub Comments

4reactions
Miffylicommented, Jan 8, 2020
  1. LSTM only memorizes past inside the single game, it does not remember things outside that episode.
  2. nminibatches specifies the number of minibatches to use when updating the policy on gathered samples. E.g. if you have 1000 samples gathered in total and nminibatches=4, it will split samples into four minibatches of 250 elements and do parameter updates on these batches noptepochs times.
  3. LSTMs are generally harder to train than non-recurrent networks (more parameters, gradients are dependent over multiple timesteps, etc etc), and the implementation here is probably not one of the best (see e.g. R2D2 paper on research on this). I would run it at least 5x longer than non-recurrent version to see when/if the learning starts to happen later.

If you feel something in docs was not clear on these questions, please point them out so we can fix these 😃

1reaction
araffincommented, Jan 10, 2020

Question was on how to test the current LSTM implementation if it works right, and so far there was trouble to solve a simple recall environment.

@Miffyli

We have a test for that 😉 https://github.com/hill-a/stable-baselines/blob/master/tests/test_lstm_policy.py#L43 (see PR https://github.com/hill-a/stable-baselines/pull/244)

Read more comments on GitHub >

github_iconTop Results From Across the Web

LSTM based policy in stable baselines3 model - Stack Overflow
I want to use a policy network with an LSTM layer in it. However, I can't find such a possibility on the library's...
Read more >
Policy Networks — Stable Baselines 2.10.3a0 documentation
MlpLstmPolicy, Policy object that implements actor critic, using LSTMs with a MLP feature extraction. MlpLnLstmPolicy, Policy object that implements actor ...
Read more >
What is a high performing network architecture to use in a ...
Artificial Intelligence Stack Exchange is a question and answer site for people interested in conceptual questions about life and challenges ...
Read more >
Shape of observation space for LSTM policies in OpenAI and ...
This is a simple question and yet one, for which I did not really find a straight forward answer. Suppose, I want to...
Read more >
Reinforcement Learning Study Group Report – February 2021
entering observations into the policy networks raised questions regarding ... We modify the baseline experiments to use an MlpLstmPolicy instead of.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found