Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

evaluate_policy() crashes with PPO2 policies trained on vectorized environments [bug]

See original GitHub issue

Describe the bug

After training PPO2 in a vectorized environment with a MLPLSTM policy, evaluate_policy() disallows evalutation with vectorized environments via assert but then crashes when evaluated with a non-vectorized environment. As far as I can tell this means evaluate_policy is incompatible with PPO2 policies trained in vectorized environments. I think this is reasonable to consider this a bug since stable baselines is failing by tensorflow crash rather than by an assert in stable baselines.

If possible, I can try fixing the crash, but it would probably be a bit faster for someone with more understanding of the recurrent policy implementation to determine whether this is something which should be fixed, or it should be patched via assert statement which disallows all PPO2 policies trained in vectorized environments to be used with evaluate_policy.

Code example

from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.common import make_vec_env

import gym
from stable_baselines import PPO2


env = make_vec_env('CartPole-v1',n_envs=12)
eval_env = gym.make('CartPole-v1')

model = PPO2('MlpLstmPolicy', env, nminibatches=1, verbose=1)
model.learn(10000)


(mean, std) = evaluate_policy(model,eval_env, n_eval_episodes = 10)
#(mean, std) = evaluate_policy(model,env, n_eval_episodes = 10)

output

Traceback (most recent call last):
  File "./minimal_example.py", line 15, in <module>
    (mean, std) = evaluate_policy(model,eval_env, n_eval_episodes = 10)
  File "/home/john/.local/lib/python3.6/site-packages/stable_baselines/common/evaluation.py", line 38, in evaluate_policy
    action, state = model.predict(obs, state=state, deterministic=deterministic)
  File "/home/john/.local/lib/python3.6/site-packages/stable_baselines/common/base_class.py", line 819, in predict
    actions, _, states, _ = self.step(observation, state, mask, deterministic=deterministic)
  File "/home/john/.local/lib/python3.6/site-packages/stable_baselines/common/policies.py", line 505, in step
    {self.obs_ph: obs, self.states_ph: state, self.dones_ph: mask})
  File "/home/john/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/home/john/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1149, in _run
    str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (1, 4) for Tensor 'input/Ob:0', which has shape '(12, 4)'

System Info ubuntu 18.04 python 3.6.9

Additional context

Issue Analytics

State:
Created 2 years ago
Comments:5

Top GitHub Comments

1reaction

balisujohncommented, Oct 3, 2021

Yep, looked like that fixed it! Yeah I’ll add a PR with an assertion for that. Thanks for your quick responses; you saved me a lot of time.

0reactions

balisujohncommented, Oct 3, 2021

(leaving this open until the pull request is ready)

Top Results From Across the Web

Stable Baselines Documentation

from stable_baselines.ddpg.policies import MlpPolicy ... Vectorized Environments are a method for multiprocess training.

Stable Baselines Documentation - Read the Docs

Here is a quick example of how to train and run PPO2 on a cartpole environment: import gym from stable_baselines.common.policies import ...

Proximal Policy Optimization - OpenAI

View on GitHubView on arXiv PPO lets us train AI policies in challenging environments, like the Roboschool one shown above where an agent ......

Stable Baselines Documentation - PDF Free ... - DocPlayer.net

13 1.5.3 Multiprocessing: Unleashing the Power of Vectorized Environments Fig. ... Evaluate policy training performance x, y = ts2xy(load_results(log_dir), ...

The 37 Implementation Details of Proximal Policy Optimization

In this architecture, PPO first initializes a vectorized environment envs that runs [Math Processing Error] N (usually independent) environments ...