[question] evaluate_policy overestimates trained agent?
See original GitHub issueIn an example, I was using PPO2 to solve cart pole.
I am using evaluate_policy
to evaluate my trained policy before visualizing it and noticed that the mean reward calculated by evaluate_policy
is consistently and considerably higher than what the same trained agent achieves when visualizing it.
See the example here:
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.bench import Monitor
# repeat 3 times to validate
for rep in range(3):
print(f"\nRepetition {rep}")
env = gym.make('CartPole-v1')
model = PPO2(MlpPolicy, Monitor(env, filename=f'logs/CartPole-v1/PPO2/'), verbose=0).learn(10000)
# evaluate
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Eval reward: {mean_reward} (+/-{std_reward})")
# test and visualize
obs = env.reset()
for i in range(500):
action, _states = model.predict(obs)
obs, rewards, done, info = env.step(action)
if done:
print(f"Failed after {i} steps.")
break
# env.render()
Running this produced:
Repetition 0
Eval reward: 500.0 (+/-0.0)
Failed after 232 steps.
Repetition 1
Eval reward: 378.6 (+/-122.41829928568686)
Failed after 152 steps.
Repetition 2
Eval reward: 457.0 (+/-60.59372904847498)
Failed after 27 steps.
As you can see, there is a huge gap between the evaluation reward and the achieved reward when manually testing it afterwards. Of course, I don’t expect it to be equal to the evaluation mean reward. But I repeated this multiple times, and the evaluation reward is always much higher (never lower).
Did I miss something in the documentation? Or what’s the reason for this big gap? I expected
I’m using Python 3.6, stable-baselines 2.10.0, Windows 10.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7
Top GitHub Comments
@Miffyli it seems he is using CartPole, so discrete actions (0 or 1).
EDIT: in that case the probability distribution is a Categorical one.
For continuous actions, it returns the mean of the gaussian, from which actions are sampled in the stochastic case. I.e. rather than sampling actions, and having possibly different actions for same observation, we take same action for same observation always.
In your case, it is probably sampling actions outside [0, 1] interval, which are then clipped to [0, 1] for it to work with the environment, hence you see actions like that even with
deterministic=False
.