question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

MPIAdam synchronization error in PPO1

See original GitHub issue

Describe the bug A simple run of PPO1 crashes. The assertion thetaroot == thetalocal fails, and it’s not due to NaNs as the floats differ. This doesn’t happen in baselines.

Code example Minimal reproducible example:

import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import PPO1

env = gym.make("CartPole-v1")
model = PPO1(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

System Info

  • Installed from source in virtual environment
  • No GPU
  • Python 3.6.5
  • mpi4py==3.0.0
  • tensorflow==1.8.0
  • Open MPI 3.1.1
  • commit 4983566292a5d3ae0ed1a6bce84a8ac8278e3de5

Stdout + Traceback

(venv) petersen33md:runs petersen33md$ mpirun -n 2 python ppo1_test.py 
********** Iteration 0 ************

********** Iteration 6 ************
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
Optimizing...
     pol_surr |    pol_entpen |       vf_loss |            kl |           ent
     -0.00082 |      -0.00627 |     117.56442 |      8.56e-05 |       0.62709
     -0.00030 |      -0.00630 |     128.11664 |      7.79e-05 |       0.63015
Traceback (most recent call last):
  File "ppo1_test.py", line 160, in <module>
    model.learn(total_timesteps=10000, callback=callback)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/ppo1/pposgd_simple.py", line 272, in learn
    self.adam.update(grad, self.optim_stepsize * cur_lrmult)
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/mpi_adam.py", line 48, in update
    self.check_synced()
  File "/Users/petersen33/repositories/stable-baselines/stable_baselines/common/mpi_adam.py", line 83, in check_synced
    assert (thetaroot == thetalocal).all(), (thetaroot, thetalocal)
AssertionError: (array([ 0.04382617, -0.0679653 , -0.11690815, ...,  0.00065254,
        0.        ,  0.        ], dtype=float32), array([ 0.04383327, -0.06797152, -0.11691316, ...,  0.00065254,
        0.        ,  0.        ], dtype=float32))

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7

github_iconTop GitHub Comments

2reactions
brendenpetersencommented, Oct 5, 2018

Update: I spotted the bug. I noticed that the error did not persist when the PPO1 argument schedule="constant" (the default value is "linear"). Annealing occurs based on the value of timesteps_so_far.

In baselines, timesteps_so_far is calculated by MPI-gathering episodes across all workers. Relevant baselines code here:

lrlocal = (seg["ep_lens"], seg["ep_rets"]) # local values
listoflrpairs = MPI.COMM_WORLD.allgather(lrlocal) # list of lens, rews = map(flatten_lists, zip(*listoflrpairs))
...
timesteps_so_far += sum(lens)

However, in stable-baselines, timesteps_so_far is based on the current worker only (which apparently can differ):

timesteps_so_far += seg["total_timestep"]

The "total_timesteps" key (which isn’t in baselines) was added at some point to avoid the “mean of an empty slice” warning when no episodes had completed. But the local values were never MPI-gathered.

To fix the bug, I changed the previous line to:

timesteps_so_far += sum(MPI.COMM_WORLD.allgather(seg["total_timestep"]))

and everything is working fine now. Let me know if you’d like me to submit a PR.

0reactions
araffincommented, Oct 12, 2018

Code and tests updated, closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

stable_baselines.ppo1.pposgd_simple - Stable Baselines
Source code for stable_baselines.ppo1.pposgd_simple ... import MpiAdam from stable_baselines.common.mpi_moments import mpi_moments from ...
Read more >
Troubleshooting General Sync Errors - IBM
The Sync client displays failure to start sync error. When the async binary on the remote computer cannot initialize, the async client gets ......
Read more >
Sync Issues folder for Outlook contains warnings such as ...
Discusses that items contain errors in the Sync Issues folder when you use Outlook 2013 or Outlook 2010 together with an Exchange Server...
Read more >
Previously had synchronization error, skipping update event
Realm Sync would not import those documents. “Detailed Error: could not convert MongoDB value to Realm payload for { table: StoresDB, path: ...
Read more >
What is a "device synchronization error" and how can I stop it ...
When sampling from multiple devices (e.g. PowerLabs, Human NIBP, or Trigno Wireless Devices) within LabChart, considerations must be made to improve ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found