question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

HER MPI broadcasting issue with non-Reach environments

See original GitHub issue

The following command runs fine:

time mpirun -np 8 python -m baselines.run --num_env=2 --alg=her --env=FetchReach-v1 --num_timesteps=100000 

However, if I try changing the environment to the FetchPush-v1 or FetchPickAndPlace-v1, I get the following error: When trying to run multiple MPI threads

Training...
Traceback (most recent call last):
  File "/home/vitchyr/anaconda2/envs/baselines2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/vitchyr/anaconda2/envs/baselines2/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/vitchyr/git/baselines/baselines/run.py", line 246, in <module>
    main(sys.argv)
  File "/home/vitchyr/git/baselines/baselines/run.py", line 210, in main
    model, env = train(args, extra_args)
  File "/home/vitchyr/git/baselines/baselines/run.py", line 79, in train
    **alg_kwargs
  File "/home/vitchyr/git/baselines/baselines/her/her.py", line 181, in learn
    policy_save_interval=policy_save_interval, demo_file=demo_file)
  File "/home/vitchyr/git/baselines/baselines/her/her.py", line 59, in train
    logger.record_tabular(key, mpi_average(val))
  File "/home/vitchyr/git/baselines/baselines/her/her.py", line 20, in mpi_average
    return mpi_moments(np.array(value))[0]
  File "/home/vitchyr/git/baselines/baselines/common/mpi_moments.py", line 22, in mpi_moments
    mean, count = mpi_mean(x, axis=axis, comm=comm, keepdims=True)
  File "/home/vitchyr/git/baselines/baselines/common/mpi_moments.py", line 16, in mpi_mean
    comm.Allreduce(localsum, globalsum, op=MPI.SUM)
  File "mpi4py/MPI/Comm.pyx", line 714, in mpi4py.MPI.Comm.Allreduce

These different environments work for me if I run them without MPI.

I am using anaconda. My Python version is 3.6.2 and this is the output of pip freeze:

absl-py==0.7.0
astor==0.7.1
baselines==0.1.5
certifi==2018.11.29
cffi==1.12.2
chardet==3.0.4
Click==7.0
cloudpickle==0.8.0
Cython==0.29.6
dill==0.2.9
future==0.17.1
gast==0.2.2
glfw==1.7.1
grpcio==1.19.0
gym==0.12.0
h5py==2.9.0
idna==2.8
imageio==2.5.0
joblib==0.13.2
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
lockfile==0.12.2
Markdown==3.0.1
mock==2.0.0
mpi4py==3.0.1
mujoco-py==1.50.1.59
numpy==1.16.2
opencv-python==4.0.0.21
pbr==5.1.3
Pillow==5.4.1
progressbar2==3.39.2
protobuf==3.7.0
pycparser==2.19
pyglet==1.3.2
python-utils==2.3.0
requests==2.21.0
scipy==1.2.1
six==1.12.0
tensorboard==1.13.0
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
tqdm==4.31.1
urllib3==1.24.1
Werkzeug==0.14.1

Issue Analytics

  • State:open
  • Created 5 years ago
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
wecacueecommented, Apr 20, 2019

It seems like an issue related to x.dtype. Probably workers are for some reason generated a different x.dtype and MPI is not able to reconcile them. A workaround is to force the dtype of localsum in line 12 to be np.float64. The right fix would be figure out why different dtypes.

0reactions
keshaviyengarcommented, Jul 8, 2019

time mpirun -np 8 meams num-cpu=8? and if i want use mpi,the command "mpirun "must be added ?

Yes, that’s how it works. Check out the mpi documentation about mpirun, but you need to add it to use mpi.

Read more comments on GitHub >

github_iconTop Results From Across the Web

MPI_Bcast not receiving broadcast data · Issue #9659 - GitHub
Our filesystem is NFS. Details of the problem. My MPI job consists of a single master process and a number manager processes, each...
Read more >
MPI Broadcast and Collective Communication - MPI Tutorial
Broadcasting with MPI_Bcast​​ One of the main uses of broadcasting is to send out user input to a parallel program, or send out...
Read more >
Error in MPI broadcast - Stack Overflow
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite...
Read more >
Tutorial - 1.60.0 - Boost C++ Libraries
A Boost.MPI program consists of many cooperating processes (possibly running on different computers) that communicate among themselves by passing messages.
Read more >
I_MPI_ADJUST Family Environment Variables - Intel
This Developer Reference provides you with the complete reference for the Intel(R) MPI Library.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found