[Question] Different results using MultiInputPolicy and MlpPolicy with the same observation data
See original GitHub issueI am able to train my custom gym environment with good results using a MlpPolicy but when I change the policy to MultiInputPolicy and insert my vector observation array into a single element dictionary I get completely different results. From the tests it looks like I should get the same results as they are both using the FlattenExtractor. This is with a PPO policy.
Is there any guidance on what to check to see why the results are different? Thank you.
The changes made were:
Change the space to a dict:
self._observation_space = spaces.Box(low=-np.ones(self.num_obs) * np.inf, high=np.ones(self.num_obs) * np.inf, dtype=np.float32)
to
self._observation_space = spaces.Dict(
spaces={
"vec": spaces.Box(low=-np.ones(self.num_obs) * np.inf, high=np.ones(self.num_obs) * np.inf, dtype=np.float32)
}
)
Change the observation to a dict:
self._observation = np.zeros((self.num_envs, self.num_obs), dtype=np.float32)
to
self._observation = {"vec": np.zeros((self.num_envs, self.num_obs), dtype=np.float32)}
Change the policy args from MlpPolicy to MultiInputPolicy:
model = PPO('MlpPolicy', env, verbose=2, tensorboard_log=saver.data_dir)
to
model = PPO('MultiInputPolicy', env, verbose=2, tensorboard_log=saver.data_dir)
### Checklist
- I have read the documentation (required)
- I have checked that there is no similar issue in the repo (required)
- I have checked my env using the env checker (required)
- I have provided a minimal working example to reproduce the bug (required)
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
TD3 — Stable Baselines3 1.7.0a8 documentation
Policy class (with both actor and critic) for TD3. MultiInputPolicy. Policy class (with both actor and critic) for TD3 to be used with...
Read more >Stablebaselines MultiInputpolicies - openai gym
But, I get an error: KeyError: "Error: unknown policy type MultiInputPolicy,the only registed policy type are: ['MlpPolicy', 'CnnPolicy']!".
Read more >Add the Bootstrapped Dual Policy Iteration algorithm for ...
The main reason I propose to have BDPI in stable-baselines3-contrib is that it is quite different from other algorithms, as it heavily focuses...
Read more >Training RL agents in stable-baselines3 is easy
Setting the policy to “MlpPolicy” means, that we are giving a state vector as input to our model. There are only 2 other...
Read more >Stable-Baselines3: Reliable Reinforcement Learning ...
To help with this problem, we present Stable-Baselines3 (SB3), ... Stable-Baselines3 keeps the same easy-to-use API while improving a lot on ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hey @jhurlbut, good question.
I wrote out the below to verify that the difference between the initial policies would be the
features_extractor
module. Hopefully this code represents your use-case, I ignored tensorboard as it shouldn’t change the policy modules (but I haven’t double checked that).Did you run both policies with multiple seeds? My first guess is that its the variability in the models training itself. Could you try multiple seeds and see multiple runs have similar performance?
Otherwise my next guess is that the issues could be in the environment wrappers if you’re using them.
the issue is now solved then 😉