[bug] PPO2 episode reward summaries are written incorrectly for VecEnvs
See original GitHub issueEpisode reward summaries are all concentrated together on a few steps, with jumps in between.
Zoomed out:
Zoomed in:
Every other summary looks fine:
To reproduce, run PPO2 on DummyVecEnv(["Pendulum-v0" for _ in range(8)])
.
Issue Analytics
- State:
- Created 5 years ago
- Reactions:4
- Comments:16
Top Results From Across the Web
Stable Baselines Documentation - Read the Docs
Optional: PPO2 requires a vectorized environment to run ... Mean training reward over the last 100 episodes mean_reward = np.mean(y[-100:]).
Read more >Changelog — Stable Baselines 2.10.3a0 documentation
Fixed a bug where a crash would occur if a PPO2 model was trained in a vectorized environment, saved and subsequently loaded, then...
Read more >stable-baselines Changelog - pyup.io
``render()`` method of ``VecEnvs`` now only accept one argument: ``mode`` ... Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls...
Read more >The 37 Implementation Details of Proximal Policy Optimization
He then opened a code editor and started writing PPO. ... A common incorrect implementation is to train PPO based on episodes and...
Read more >The 32 Implementation Details of Proximal Policy Optimization ...
https://github.com/openai/baselines/blob/master/baselines/ppo2/ ... does not improve the performance, measured in episode rewards achieved.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi, I also encountered some issues described in the comments above. A recap follows.
PPO2 tensorboard visualization issues
If you run ppo2 with a single process training for 256 timesteps (
N=1
,T=256
) and try to visualize the episode reward and the optimization statistics:T
(instead of being in [0,256], it is plotted in [256,512]) for the reason explained in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-552952355timestep
calculations highlighted in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-584530173Moreover, if you try to plot data using multiple processes (for instance
N=4
workers withT=256
timesteps per worker):T
timesteps followed by a jump of(N-1)*T
timesteps in the plotPPO tensorboard visualization proposed solution
I implemented the following solutions for the visualization issues:
K
epochs onN*T//M
minibatches (beingM
the training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namelyK * N*T//M
K * N*T//M
optimization data are equally distributed over the batch sizeN*T
As a result, in the showcases above:
N
workers are plotted side by sideThe modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?
If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.
@paolo-viceconte thanks, I’ll try to take a look at what you did this week (unless @Miffyli can do it before), we have too many issue related to that function (cf all linked issues).