Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature Request] Log actual return, and not only mean return

See original GitHub issue

🚀 Log actual return, and not only mean return

Motivation

When you use Monitor, it is logged (on tensorboard and on terminal) the mean return considering the last 100 episodes. This happens because self.ep_info_buffer is instantiated as a deque(maxlen=100) on BaseAlgorithm class, and on “_dump_logs” method it is logged the mean of the values in this buffer. I think it is more “natural” to log the actual episode return (total sum of rewards on this specific episode), which is what is commonly reported on papers.

Pitch

Log the return (total episode reward) every time an episode ends.

Alternatives

This can be done simply by adding the line below to the method “_dump_logs()” of the class OffPolicyAlgorithm:

 if len(self.ep_info_buffer) > 0 and len(self.ep_info_buffer[0]) > 0:
      logger.record("rollout/return", self.ep_info_buffer[-1]["r"])  # ADD THIS LINE
      logger.record("rollout/ep_rew_mean", safe_mean([ep_info["r"] for ep_info in self.ep_info_buffer]))
      logger.record("rollout/ep_len_mean", safe_mean([ep_info["l"] for ep_info in self.ep_info_buffer]))

Additional context

rollout/ep_rew_mean vs actual return (created by me using the modification above).

### Checklist

I have checked that there is no similar issue in the repo (required)

Issue Analytics

State:
Created 3 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

araffincommented, Feb 8, 2021

because the goal of RL is to maximize return, and it is what someone would most probably report on a paper.

I would disagree with that statement. First, most of the time, as mentioned in the documentation you should not report the training reward but use a separate environment for periodic evaluation (and deterministic actions, except for Atari games), this is what the EvalCallback is meant for (included in the rl zoo, together with plotting script, see reproducing results). Then, in papers, when training reward is reported, it is usually done using a smoothing window otherwise it is too noisy to be readable. Also, if you want to report the episode return with different windows (or no smoothing window at all) , we save the monitor files and have a plotting script for that in the zoo too.