[question] Questions about MlpLstmPolicy
See original GitHub issueI successfully implemented PPO2 with MlpPolicy with two different custom environments I built. Now I want to extend to MlpLstmPolicy in one of my games.
I tried to understand the MlpLstmPolicy by reading the source code but it’s a bit involved. So several questions:
- If successfully implemented, does the LSTM memorize the steps taken in a game only? Or does it also memorize what steps it took in the previous games (before resetting)?
Follow up question on this, if the answer to the second question is no, is there any way to achieve this? Concretely, I want my agent to come up with paths that are vastly different with the previous games (quantitatively measured by correlation). Implementing curiosity might seem to help, but it is not directly learning to find paths distinct from the previous games.
-
What role does the variable
nminibatches
play in training? Does it only affect the training speed? -
I tried replacing MlpPolicy with MlpLstmPolicy in my game directly without changing anything, and it appears that the learning is much worse - even after many more learning steps, the reward is far worse than that learnt with MlpPolicy. Are there general tips to using MlpLstmPolicy / necessary modifications when switching from MlpPolicy to MlpLstmPolicy?
Thanks a million in advance!
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:14
Top GitHub Comments
nminibatches
specifies the number of minibatches to use when updating the policy on gathered samples. E.g. if you have 1000 samples gathered in total andnminibatches=4
, it will split samples into four minibatches of250
elements and do parameter updates on these batchesnoptepochs
times.If you feel something in docs was not clear on these questions, please point them out so we can fix these 😃
@Miffyli
We have a test for that 😉 https://github.com/hill-a/stable-baselines/blob/master/tests/test_lstm_policy.py#L43 (see PR https://github.com/hill-a/stable-baselines/pull/244)