question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[question] evaluate_policy overestimates trained agent?

See original GitHub issue

In an example, I was using PPO2 to solve cart pole. I am using evaluate_policy to evaluate my trained policy before visualizing it and noticed that the mean reward calculated by evaluate_policy is consistently and considerably higher than what the same trained agent achieves when visualizing it.

See the example here:

import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO2
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines.bench import Monitor

# repeat 3 times to validate
for rep in range(3):
    print(f"\nRepetition {rep}")
    env = gym.make('CartPole-v1')
    model = PPO2(MlpPolicy, Monitor(env, filename=f'logs/CartPole-v1/PPO2/'), verbose=0).learn(10000)

    # evaluate
    mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
    print(f"Eval reward: {mean_reward} (+/-{std_reward})")

    # test and visualize
    obs = env.reset()
    for i in range(500):
        action, _states = model.predict(obs)
        obs, rewards, done, info = env.step(action)
        if done:
            print(f"Failed after {i} steps.")
            break
        # env.render()

Running this produced:

Repetition 0
Eval reward: 500.0 (+/-0.0)
Failed after 232 steps.

Repetition 1
Eval reward: 378.6 (+/-122.41829928568686)
Failed after 152 steps.

Repetition 2
Eval reward: 457.0 (+/-60.59372904847498)
Failed after 27 steps.

As you can see, there is a huge gap between the evaluation reward and the achieved reward when manually testing it afterwards. Of course, I don’t expect it to be equal to the evaluation mean reward. But I repeated this multiple times, and the evaluation reward is always much higher (never lower).

Did I miss something in the documentation? Or what’s the reason for this big gap? I expected

I’m using Python 3.6, stable-baselines 2.10.0, Windows 10.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7

github_iconTop GitHub Comments

1reaction
araffincommented, Apr 28, 2020

@Miffyli it seems he is using CartPole, so discrete actions (0 or 1).

EDIT: in that case the probability distribution is a Categorical one.

1reaction
Miffylicommented, Apr 28, 2020

For continuous actions, it returns the mean of the gaussian, from which actions are sampled in the stochastic case. I.e. rather than sampling actions, and having possibly different actions for same observation, we take same action for same observation always.

In your case, it is probably sampling actions outside [0, 1] interval, which are then clipped to [0, 1] for it to work with the environment, hence you see actions like that even with deterministic=False.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Induction over constrained strategic agents - ScienceDirect.com
In this paper, we explore Induction over Strategic Agents for a class of problems where attributes are binary values. Section snippets.
Read more >
NQMIX: Non-monotonic Value Function Factorization ... - arXiv
QMIX that the learning target is overestimated. ... value function which is used to train decentralized actor for each agent, which can.
Read more >
Developing and Using Questionnaires - GovInfo
GAO presented a guide to its evaluators with information on the use of questionnaires in their evaluations, focusing on the rationales for determining...
Read more >
Tighter Problem-Dependent Regret Bounds in Reinforcement ...
In reinforcement learning (RL) an agent must learn how to make good decision without having ... ing domain-specific RL solutions and in training...
Read more >
Deep Reinforcement Learning to Simulate, Train, and ...
We used reinforcement learning to train pedagogical poli- ... tutor as the RL agent, and a simulated student to estimate ... Train, Evaluate...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found