Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPIAdam synchronization error in PPO1

See original GitHub issue

Describe the bug A simple run of PPO1 crashes. The assertion thetaroot == thetalocal fails, and it’s not due to NaNs as the floats differ. This doesn’t happen in baselines.

Code example Minimal reproducible example:

import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO1

env = gym.make("CartPole-v1")
model = PPO1(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

System Info

Installed from source in virtual environment
No GPU
Python 3.6.5
mpi4py==3.0.0
tensorflow==1.8.0
Open MPI 3.1.1
commit 4983566292a5d3ae0ed1a6bce84a8ac8278e3de5

Stdout + Traceback

(venv) petersen33md:runs petersen33md$ mpirun -n 2 python ppo1_test.py 
********** Iteration 0 ************

…

********** Iteration 6 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00082 |      -0.00627 |     117.56442 |      8.56e-05 |       0.62709
     -0.00030 |      -0.00630 |     128.11664 |      7.79e-05 |       0.63015
Traceback (most recent call last):
  File "ppo1_test.py", line 160, in <module>
    model.learn(total_timesteps=10000, callback=callback)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 272, in learn
    self.adam.update(grad, self.optim_stepsize * cur_lrmult)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/mpi_adam.py", line 48, in update
    self.check_synced()
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/mpi_adam.py", line 83, in check_synced
    assert (thetaroot == thetalocal).all(), (thetaroot, thetalocal)
AssertionError: (array([ 0.04382617, -0.0679653 , -0.11690815, ...,  0.00065254,
        0.        ,  0.        ], dtype=float32), array([ 0.04383327, -0.06797152, -0.11691316, ...,  0.00065254,
        0.        ,  0.        ], dtype=float32))

Issue Analytics

State:
Created 5 years ago
Comments:7

Top GitHub Comments

2reactions

brendenpetersencommented, Oct 5, 2018

Update: I spotted the bug. I noticed that the error did not persist when the PPO1 argument schedule="constant" (the default value is "linear"). Annealing occurs based on the value of timesteps_so_far.

In baselines, timesteps_so_far is calculated by MPI-gathering episodes across all workers. Relevant baselines code here:

lrlocal = (seg["ep_lens"], seg["ep_rets"]) # local values
listoflrpairs = MPI.COMM_WORLD.allgather(lrlocal) # list of lens, rews = map(flatten_lists, zip(*listoflrpairs))
...
timesteps_so_far += sum(lens)

However, in stable-baselines, timesteps_so_far is based on the current worker only (which apparently can differ):

timesteps_so_far += seg["total_timestep"]

The "total_timesteps" key (which isn’t in baselines) was added at some point to avoid the “mean of an empty slice” warning when no episodes had completed. But the local values were never MPI-gathered.

To fix the bug, I changed the previous line to:

timesteps_so_far += sum(MPI.COMM_WORLD.allgather(seg["total_timestep"]))

and everything is working fine now. Let me know if you’d like me to submit a PR.

0reactions

araffincommented, Oct 12, 2018

Code and tests updated, closing.

Top Results From Across the Web

stable_baselines.ppo1.pposgd_simple - Stable Baselines

Source code for stable_baselines.ppo1.pposgd_simple ... import MpiAdam from stable_baselines.common.mpi_moments import mpi_moments from ...

Troubleshooting General Sync Errors - IBM

The Sync client displays failure to start sync error. When the async binary on the remote computer cannot initialize, the async client gets ......

Sync Issues folder for Outlook contains warnings such as ...

Discusses that items contain errors in the Sync Issues folder when you use Outlook 2013 or Outlook 2010 together with an Exchange Server...

Previously had synchronization error, skipping update event

Realm Sync would not import those documents. “Detailed Error: could not convert MongoDB value to Realm payload for { table: StoresDB, path: ...

What is a "device synchronization error" and how can I stop it ...

When sampling from multiple devices (e.g. PowerLabs, Human NIBP, or Trigno Wireless Devices) within LabChart, considerations must be made to improve ...