HER MPI broadcasting issue with non-Reach environments
See original GitHub issueThe following command runs fine:
time mpirun -np 8 python -m baselines.run --num_env=2 --alg=her --env=FetchReach-v1 --num_timesteps=100000
However, if I try changing the environment to the FetchPush-v1 or FetchPickAndPlace-v1, I get the following error: When trying to run multiple MPI threads
Training...
Traceback (most recent call last):
File "/home/vitchyr/anaconda2/envs/baselines2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/vitchyr/anaconda2/envs/baselines2/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/vitchyr/git/baselines/baselines/run.py", line 246, in <module>
main(sys.argv)
File "/home/vitchyr/git/baselines/baselines/run.py", line 210, in main
model, env = train(args, extra_args)
File "/home/vitchyr/git/baselines/baselines/run.py", line 79, in train
**alg_kwargs
File "/home/vitchyr/git/baselines/baselines/her/her.py", line 181, in learn
policy_save_interval=policy_save_interval, demo_file=demo_file)
File "/home/vitchyr/git/baselines/baselines/her/her.py", line 59, in train
logger.record_tabular(key, mpi_average(val))
File "/home/vitchyr/git/baselines/baselines/her/her.py", line 20, in mpi_average
return mpi_moments(np.array(value))[0]
File "/home/vitchyr/git/baselines/baselines/common/mpi_moments.py", line 22, in mpi_moments
mean, count = mpi_mean(x, axis=axis, comm=comm, keepdims=True)
File "/home/vitchyr/git/baselines/baselines/common/mpi_moments.py", line 16, in mpi_mean
comm.Allreduce(localsum, globalsum, op=MPI.SUM)
File "mpi4py/MPI/Comm.pyx", line 714, in mpi4py.MPI.Comm.Allreduce
These different environments work for me if I run them without MPI.
I am using anaconda. My Python version is 3.6.2 and this is the output of pip freeze
:
absl-py==0.7.0
astor==0.7.1
baselines==0.1.5
certifi==2018.11.29
cffi==1.12.2
chardet==3.0.4
Click==7.0
cloudpickle==0.8.0
Cython==0.29.6
dill==0.2.9
future==0.17.1
gast==0.2.2
glfw==1.7.1
grpcio==1.19.0
gym==0.12.0
h5py==2.9.0
idna==2.8
imageio==2.5.0
joblib==0.13.2
Keras-Applications==1.0.7
Keras-Preprocessing==1.0.9
lockfile==0.12.2
Markdown==3.0.1
mock==2.0.0
mpi4py==3.0.1
mujoco-py==1.50.1.59
numpy==1.16.2
opencv-python==4.0.0.21
pbr==5.1.3
Pillow==5.4.1
progressbar2==3.39.2
protobuf==3.7.0
pycparser==2.19
pyglet==1.3.2
python-utils==2.3.0
requests==2.21.0
scipy==1.2.1
six==1.12.0
tensorboard==1.13.0
tensorflow==1.13.1
tensorflow-estimator==1.13.0
termcolor==1.1.0
tqdm==4.31.1
urllib3==1.24.1
Werkzeug==0.14.1
Issue Analytics
- State:
- Created 5 years ago
- Comments:8 (1 by maintainers)
Top Results From Across the Web
MPI_Bcast not receiving broadcast data · Issue #9659 - GitHub
Our filesystem is NFS. Details of the problem. My MPI job consists of a single master process and a number manager processes, each...
Read more >MPI Broadcast and Collective Communication - MPI Tutorial
Broadcasting with MPI_Bcast One of the main uses of broadcasting is to send out user input to a parallel program, or send out...
Read more >Error in MPI broadcast - Stack Overflow
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite...
Read more >Tutorial - 1.60.0 - Boost C++ Libraries
A Boost.MPI program consists of many cooperating processes (possibly running on different computers) that communicate among themselves by passing messages.
Read more >I_MPI_ADJUST Family Environment Variables - Intel
This Developer Reference provides you with the complete reference for the Intel(R) MPI Library.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
It seems like an issue related to
x.dtype
. Probably workers are for some reason generated a differentx.dtype
and MPI is not able to reconcile them. A workaround is to force the dtype of localsum in line 12 to benp.float64
. The right fix would be figure out why different dtypes.Yes, that’s how it works. Check out the mpi documentation about mpirun, but you need to add it to use mpi.