Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] evaluate_policy overestimates trained agent?

See original GitHub issue

In an example, I was using PPO2 to solve cart pole. I am using evaluate_policy to evaluate my trained policy before visualizing it and noticed that the mean reward calculated by evaluate_policy is consistently and considerably higher than what the same trained agent achieves when visualizing it.

See the example here:

import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.bench import Monitor

# repeat 3 times to validate
for rep in range(3):
    print(f"\nRepetition {rep}")
    env = gym.make('CartPole-v1')
    model = PPO2(MlpPolicy, Monitor(env, filename=f'logs/CartPole-v1/PPO2/'), verbose=0).learn(10000)

    # evaluate
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
    print(f"Eval reward: {mean_reward} (+/-{std_reward})")

    # test and visualize
    obs = env.reset()
    for i in range(500):
        action, _states = model.predict(obs)
        obs, rewards, done, info = env.step(action)
        if done:
            print(f"Failed after {i} steps.")
            break
        # env.render()

Running this produced:

Repetition 0
Eval reward: 500.0 (+/-0.0)
Failed after 232 steps.

Repetition 1
Eval reward: 378.6 (+/-122.41829928568686)
Failed after 152 steps.

Repetition 2
Eval reward: 457.0 (+/-60.59372904847498)
Failed after 27 steps.

As you can see, there is a huge gap between the evaluation reward and the achieved reward when manually testing it afterwards. Of course, I don’t expect it to be equal to the evaluation mean reward. But I repeated this multiple times, and the evaluation reward is always much higher (never lower).

Did I miss something in the documentation? Or what’s the reason for this big gap? I expected

I’m using Python 3.6, stable-baselines 2.10.0, Windows 10.

Issue Analytics

State:
Created 3 years ago
Comments:7

Top GitHub Comments

1reaction

araffincommented, Apr 28, 2020

@Miffyli it seems he is using CartPole, so discrete actions (0 or 1).

EDIT: in that case the probability distribution is a Categorical one.

1reaction

Miffylicommented, Apr 28, 2020

For continuous actions, it returns the mean of the gaussian, from which actions are sampled in the stochastic case. I.e. rather than sampling actions, and having possibly different actions for same observation, we take same action for same observation always.

In your case, it is probably sampling actions outside [0, 1] interval, which are then clipped to [0, 1] for it to work with the environment, hence you see actions like that even with deterministic=False.

Top Results From Across the Web

Induction over constrained strategic agents - ScienceDirect.com

In this paper, we explore Induction over Strategic Agents for a class of problems where attributes are binary values. Section snippets.

NQMIX: Non-monotonic Value Function Factorization ... - arXiv

QMIX that the learning target is overestimated. ... value function which is used to train decentralized actor for each agent, which can.

Developing and Using Questionnaires - GovInfo

GAO presented a guide to its evaluators with information on the use of questionnaires in their evaluations, focusing on the rationales for determining...

Tighter Problem-Dependent Regret Bounds in Reinforcement ...

In reinforcement learning (RL) an agent must learn how to make good decision without having ... ing domain-specific RL solutions and in training...

Deep Reinforcement Learning to Simulate, Train, and ...

We used reinforcement learning to train pedagogical poli- ... tutor as the RL agent, and a simulated student to estimate ... Train, Evaluate...