[Question] Is the rollout parameter "ep_rew_mean" more relevant or the output of the Evaluation Helper
See original GitHub issueHi all,
I would like to know if the rollout parameter “ep_rew_mean” is more relevant or the output of the Evaluation Helper from Stable-Baselines3. In my application they sometimes differ strongly (while sometimes beeing similar).
I have a custom OpenAI-Gym environment and I calculate a reward and return it in the step function. Now I have 2 methods of evaluating a trained model. One is a manual approach where I just sum over all rewards after each action is taken. The code looks like this:
#Load a trained model
from stable_baselines3 import A2C
# Create the environment
env = DSM_BT1_Env()
# load the saved model
models_dir = "models/A2C"
model_path = f"{models_dir}/Started_22-04-2022--11-01-35_Episode_3_Timesteps_100000"
model = A2C.load(model_path, env=env)
logdir = "logs"
episodes = 200
averageCumulativeReward_HelpSum = 0
results_Overall = np.zeros(episodes )
print("\n \n \n \n Testing with trained model manually \n \n \n")
for episode in range (episodes):
obs = env.reset()
done = False
print("episode: ", episode)
timeslot = 1
cumulativeReward = 0
while not done:
env.render()
action,_ = model.predict(obs)
timeslot = timeslot+1
obs, reward, done, info = env.step(action, deterministic=False)
cumulativeReward += reward
print(f"cumulative Reward: {round(cumulativeReward,2)} \n")
results_Overall [episode] = round(cumulativeReward,2)
averageCumulativeReward_HelpSum += cumulativeReward
env.close()
Furthermore I use the Evaluation Helper from Stable-Baselines3 as a second instance of evaluation. The code looks like this:
from stable_baselines3.common.evaluation import evaluate_policy
episodes = 200
# Create the environment
env = DSM_BT1_Env()
# load the saved model
models_dir = "models/A2C"
model_path = f"{models_dir}/Started_22-04-2022--11-01-35_Episode_3_Timesteps_100000"
model = A2C.load(model_path, env=env)
mean_reward, std_reward = evaluate_policy(model, model.get_env(), n_eval_episodes= episodes , deterministic=False )
print("\n \n \n \n Testing with trained model Evaluation Helper \n")
print(f"mean_reward = {mean_reward}")
print(f"std_reward = {std_reward}")
Here you can see the results of 4 successively trained models in Tensorboard for the rollout parameter “ep_rew_mean”
So I have 4 trained models (I confusingly name them Episode_1 to Episodes_4. This should not be confused with the episodes runs of a OpenAI-Gym custom environment). The average results for the manual evaluation and the Evaluation Helper when having 200 runs are:
What is striking is that while for the Episode_2 and Episode_4 the rollout parameter “ep_rew_mean” may be in line with the manually calculated reward and the reward of the Evaluation Helper, for Episode_1 and Episode_3 this is definitely not the case. There is a huge differences between the rollout parameter “ep_rew_mean” and the real reward.
It might be worth mentioning that I also print out the reward within the OpenAI-Gym environment for every timeslot and sum it up. The resulting award after each run in the OpenAI-Gym environment is always identical to the values from the manual evaluation or the Evaluation Helper. This is why I call this the real reward. As far as I see it, there can’t be a mistake within the OpenAI-Gym environment regarding the calculation of the reward as the sum over all returned rewards is always equal to the output of the Evaluation Helper.
Now my question is whether the rollout parameter “ep_rew_mean” is more relevant or the output of the Evaluation Helper (or the manual evaluation)? In my example, according to the rollout parameter “ep_rew_mean” Episode_4 is better than Episode_3 but when calculating the real reward both the Evaluation Helper and the manual evaluation favour Episode_3. Normally I would say that the output of the Evaluation Helper is more relevant as it is always identical to the returned sum of rewards of the OpenAI-Gym environment and most of the time has quite similar values as the manual evaluation (which also just sums over the returned rewards of the OpenAI-Gym environment). However, during the training process in Stable-Baselines3 the rollout parameter “ep_rew_mean” is always mentioned this is why I am wondering whether the algorithms in Stable-Baselines3 optimize for the real reward or the rollout parameter “ep_rew_mean”?
Issue Analytics
- State:
- Created a year ago
- Reactions:1
- Comments:13 (1 by maintainers)
Top GitHub Comments
@Miffyli: Thanks a lot for your answer and effort. I really appreciate it. Okay I understand. Maybe one last question on this topic which is also the core topic of my question. Is it possible to let the algorithms from Stable-Baselines 3 optimize for the output of the evaluation helper instead of the rollout parameter “ep_rew_mean”. During the training process, the rollout parameter “ep_rew_mean” is always mentioned this is why I assume that naturally the algorithms in Stable-Baselines 3 optimize for that parameter which in my example is not the “correct” parameter as I would like to optimize for the sum of cumulative rewards which is essentially the output of the Evaluation Helper.
It should of course optimize over rewards defined by your environment, the answer is elsewhere. I also have discrepency between evaluation rewards and rollout rewards…