question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs

See original GitHub issue

Episode reward summaries are all concentrated together on a few steps, with jumps in between.

Zoomed out: image

Zoomed in: image

Every other summary looks fine: image

To reproduce, run PPO2 on DummyVecEnv(["Pendulum-v0" for _ in range(8)]).

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:4
  • Comments:16

github_iconTop GitHub Comments

3reactions
paolo-vicecontecommented, Mar 11, 2020

Hi, I also encountered some issues described in the comments above. A recap follows.

PPO2 tensorboard visualization issues

If you run ppo2 with a single process training for 256 timesteps (N=1, T=256) and try to visualize the episode reward and the optimization statistics:

  1. the episode_reward is shifted of T (instead of being in [0,256], it is plotted in [256,512]) for the reason explained in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-552952355
  2. the loss statistics are associated with weird timesteps (i.e. [527,782]) obtained as a result of the timestep calculations highlighted in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-584530173

issue_10_marzo_1

Moreover, if you try to plot data using multiple processes (for instance N=4 workers with T=256 timesteps per worker):

  1. the collected reward are superposed in the first T timesteps followed by a jump of (N-1)*T timesteps in the plot

PPO tensorboard visualization proposed solution

I implemented the following solutions for the visualization issues:

  1. decreasing the timesteps index by the batch size before plotting
  2. simplifying the logic for plotting the optimization statistics:
    • each optimization is made of K epochs on N*T//M minibatches (being M the training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namely K * N*T//M
    • in order to retain visual comparison of the episode reward and the optimization statistics, the K * N*T//M optimization data are equally distributed over the batch size N*T
  3. adding an offset for each process

As a result, in the showcases above:

  1. the episode_reward is correctly plotted [0,256]
  2. the loss statistics are plotted in [0,256] as well, equally distributed

issue_10_Marzo_3

  1. the rewards collected by the N workers are plotted side by side

The modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?

If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.

2reactions
araffincommented, Mar 15, 2020

@paolo-viceconte thanks, I’ll try to take a look at what you did this week (unless @Miffyli can do it before), we have too many issue related to that function (cf all linked issues).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Stable Baselines Documentation - Read the Docs
Optional: PPO2 requires a vectorized environment to run ... Mean training reward over the last 100 episodes mean_reward = np.mean(y[-100:]).
Read more >
Changelog — Stable Baselines 2.10.3a0 documentation
Fixed a bug where a crash would occur if a PPO2 model was trained in a vectorized environment, saved and subsequently loaded, then...
Read more >
stable-baselines Changelog - pyup.io
``render()`` method of ``VecEnvs`` now only accept one argument: ``mode`` ... Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls...
Read more >
The 37 Implementation Details of Proximal Policy Optimization
He then opened a code editor and started writing PPO. ... A common incorrect implementation is to train PPO based on episodes and...
Read more >
The 32 Implementation Details of Proximal Policy Optimization ...
https://github.com/openai/baselines/blob/master/baselines/ppo2/ ... does not improve the performance, measured in episode rewards achieved.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found