question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] Is the rollout parameter "ep_rew_mean" more relevant or the output of the Evaluation Helper

See original GitHub issue

Hi all,

I would like to know if the rollout parameter “ep_rew_mean” is more relevant or the output of the Evaluation Helper from Stable-Baselines3. In my application they sometimes differ strongly (while sometimes beeing similar).

I have a custom OpenAI-Gym environment and I calculate a reward and return it in the step function. Now I have 2 methods of evaluating a trained model. One is a manual approach where I just sum over all rewards after each action is taken. The code looks like this:

#Load a trained model
from stable_baselines3 import A2C

# Create the environment
env = DSM_BT1_Env()

# load the saved model
models_dir = "models/A2C"
model_path = f"{models_dir}/Started_22-04-2022--11-01-35_Episode_3_Timesteps_100000"
model = A2C.load(model_path, env=env)

logdir = "logs"

episodes = 200
averageCumulativeReward_HelpSum = 0
results_Overall = np.zeros(episodes )


print("\n \n \n \n Testing with trained model manually \n \n \n")

for episode in range (episodes):
    obs = env.reset()
    done = False
    print("episode: ", episode)
    timeslot = 1
    cumulativeReward = 0
    while not done:
        env.render()
        action,_ = model.predict(obs)
        timeslot = timeslot+1

        obs, reward, done, info = env.step(action, deterministic=False)
        cumulativeReward += reward
    print(f"cumulative Reward: {round(cumulativeReward,2)} \n")
    results_Overall [episode] = round(cumulativeReward,2)
    averageCumulativeReward_HelpSum += cumulativeReward

env.close()

Furthermore I use the Evaluation Helper from Stable-Baselines3 as a second instance of evaluation. The code looks like this:

from stable_baselines3.common.evaluation import evaluate_policy
episodes = 200
# Create the environment
env = DSM_BT1_Env()

# load the saved model
models_dir = "models/A2C"
model_path = f"{models_dir}/Started_22-04-2022--11-01-35_Episode_3_Timesteps_100000"
model = A2C.load(model_path, env=env)

mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes=  episodes , deterministic=False )
print("\n \n \n \n Testing with trained model Evaluation Helper \n")
print(f"mean_reward = {mean_reward}")
print(f"std_reward = {std_reward}")

Here you can see the results of 4 successively trained models in Tensorboard for the rollout parameter “ep_rew_mean” Tensorboard_results

So I have 4 trained models (I confusingly name them Episode_1 to Episodes_4. This should not be confused with the episodes runs of a OpenAI-Gym custom environment). The average results for the manual evaluation and the Evaluation Helper when having 200 runs are: grafik

What is striking is that while for the Episode_2 and Episode_4 the rollout parameter “ep_rew_mean” may be in line with the manually calculated reward and the reward of the Evaluation Helper, for Episode_1 and Episode_3 this is definitely not the case. There is a huge differences between the rollout parameter “ep_rew_mean” and the real reward.

It might be worth mentioning that I also print out the reward within the OpenAI-Gym environment for every timeslot and sum it up. The resulting award after each run in the OpenAI-Gym environment is always identical to the values from the manual evaluation or the Evaluation Helper. This is why I call this the real reward. As far as I see it, there can’t be a mistake within the OpenAI-Gym environment regarding the calculation of the reward as the sum over all returned rewards is always equal to the output of the Evaluation Helper.

Now my question is whether the rollout parameter “ep_rew_mean” is more relevant or the output of the Evaluation Helper (or the manual evaluation)? In my example, according to the rollout parameter “ep_rew_mean” Episode_4 is better than Episode_3 but when calculating the real reward both the Evaluation Helper and the manual evaluation favour Episode_3. Normally I would say that the output of the Evaluation Helper is more relevant as it is always identical to the returned sum of rewards of the OpenAI-Gym environment and most of the time has quite similar values as the manual evaluation (which also just sums over the returned rewards of the OpenAI-Gym environment). However, during the training process in Stable-Baselines3 the rollout parameter “ep_rew_mean” is always mentioned this is why I am wondering whether the algorithms in Stable-Baselines3 optimize for the real reward or the rollout parameter “ep_rew_mean”?

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:1
  • Comments:13 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
PBeritcommented, May 4, 2022

@Miffyli: Thanks a lot for your answer and effort. I really appreciate it. Okay I understand. Maybe one last question on this topic which is also the core topic of my question. Is it possible to let the algorithms from Stable-Baselines 3 optimize for the output of the evaluation helper instead of the rollout parameter “ep_rew_mean”. During the training process, the rollout parameter “ep_rew_mean” is always mentioned this is why I assume that naturally the algorithms in Stable-Baselines 3 optimize for that parameter which in my example is not the “correct” parameter as I would like to optimize for the sum of cumulative rewards which is essentially the output of the Evaluation Helper.

0reactions
ReHosscommented, Nov 12, 2022

Thanks Miffyli for your answer and effort. I really appreciate it. For me it is kind of strange to hear that the algorithms of Stable-Baselines 3 do not necessarily optimize for the reward that I define in the step function of the Open-AI gym environment. This makes it quite hard (and almost impossible) to optimize for anything as I don’t know what effects my defined reward will have on the optimization process of the Stable-Baselines 3 algorithms. I am stronlgy questioning the rationale behind this approach. If Stable-Baselines 3 somehow transforms the reward that I define in the step function of the Open-AI gym environment, it is quite intransparent what I need to do in order to get good reward values at the end, as the reward I want to optimize and that I implemented is not the optimization goal of the Stable-Baselines 3 algorithms.

It should of course optimize over rewards defined by your environment, the answer is elsewhere. I also have discrepency between evaluation rewards and rollout rewards…

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Question] Is the rollout parameter "ep_rew_mean" more ...
Hi all,. I would like to know if the rollout parameter "ep_rew_mean" is more relevant or the output of the Evaluation Helper from...
Read more >
[Feature][rllib/tune] Deprecate RLLib's rollout/evaluate in favor ...
Description Currently, there's a major gap, or feature disparity, between running standalone evaluations with RLlib's rollout command vs.
Read more >
Evaluation Helper — Stable Baselines3 1.7.0a5 documentation
Runs policy for n_eval_episodes episodes and returns average reward. If a vector env is passed in, this divides the episodes to evaluate onto...
Read more >
Why does Stable Baselines evaluation helper need the ...
I think it is to allow evaluation to be done on other environments than the one used to train the model.
Read more >
Training a RL Agent with Stable-Baselines3 Using a GEM ...
The goal of this notebook is to give an understanding of what Stable-Baselines3 is and how to use it to train and evaluate...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found