Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs

See original GitHub issue

Episode reward summaries are all concentrated together on a few steps, with jumps in between.

Zoomed out:

Zoomed in:

Every other summary looks fine:

To reproduce, run PPO2 on DummyVecEnv(["Pendulum-v0" for _ in range(8)]).

Issue Analytics

State:
Created 5 years ago
Reactions:4
Comments:16

Top GitHub Comments

3reactions

paolo-vicecontecommented, Mar 11, 2020

Hi, I also encountered some issues described in the comments above. A recap follows.

PPO2 tensorboard visualization issues

If you run ppo2 with a single process training for 256 timesteps (N=1, T=256) and try to visualize the episode reward and the optimization statistics:

the episode_reward is shifted of T (instead of being in [0,256], it is plotted in [256,512]) for the reason explained in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-552952355
the loss statistics are associated with weird timesteps (i.e. [527,782]) obtained as a result of the timestep calculations highlighted in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-584530173

issue_10_marzo_1

Moreover, if you try to plot data using multiple processes (for instance N=4 workers with T=256 timesteps per worker):

the collected reward are superposed in the first T timesteps followed by a jump of (N-1)*T timesteps in the plot

PPO tensorboard visualization proposed solution

I implemented the following solutions for the visualization issues:

decreasing the timesteps index by the batch size before plotting
simplifying the logic for plotting the optimization statistics:
- each optimization is made of K epochs on N*T//M minibatches (being M the training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namely K * N*T//M
- in order to retain visual comparison of the episode reward and the optimization statistics, the K * N*T//M optimization data are equally distributed over the batch size N*T
adding an offset for each process

As a result, in the showcases above:

the episode_reward is correctly plotted [0,256]
the loss statistics are plotted in [0,256] as well, equally distributed

issue_10_Marzo_3

the rewards collected by the N workers are plotted side by side

The modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?

If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.

2reactions

araffincommented, Mar 15, 2020

@paolo-viceconte thanks, I’ll try to take a look at what you did this week (unless @Miffyli can do it before), we have too many issue related to that function (cf all linked issues).