`SubprocVecEnv` speedup does not scale linearly compared with `DummyVecEnv`
See original GitHub issueI made some toy benchmark by creating 16 environments for both SubprocVecEnv
and DummyVecEnv
. And collect 1000 time steps by firstly reset the environment and feed random action sampled from action space within a for loop.
It turns out the speed of the simulator step is quite crucial for total speedup. For example, HalfCheetah-v2
is roughly 1.5-2x faster and ‘FetchPush-v1’ could be 7-9x faster. I guess it depends on the dynamics where cheetah is simpler.
For classic control environments like CartPole-v1
, it seems using DummyVecEnv
is much better, since the speedup is ~0.2x, i.e. 5x slower than DummyVecEnv
.
I am considering that if it is feasible to scale up the speedup further to be approximately linear with the number of environments ? Or the main reason is coming from computing overhead in the Process
?
Issue Analytics
- State:
- Created 5 years ago
- Reactions:1
- Comments:14 (10 by maintainers)
Top GitHub Comments
There is an additional benchmark on some Mujoco environments (tested on DGX-1)
chunks of sub-environments per process instead of one per process is a great idea! I’d be very interested to see the results of that (how much faster does venv.step() method become with different types of sub-environments. asyncio - also a good one, although I’d rather keep things compatible with python 3.6 and below. Anyways, if you feel like implementing any of these - do not let me stop you from submitting a PR 😃 To the MPI vs multiprocessing question - the VecEnv configuration (master process updating neural net, subprocesses running env.step) is especially beneficial for conv nets and atari-like envs, because then the updates from relatively large batches can be computed on a GPU fast (much faster than if every process were to run gradient computation on its own - several processes actively interacting with GPU is usually not a great setup). In principle, the same communication pattern can be done with MPI, but it is a little more involved, and requires MPI installed. On the other hand, in mujoco-like environments (when using non-pixel observations - positions and velocities of joints) neural nets are relatively small, so batching data to compute the update does not give much of a speed-up; on the other hand, with MPI you can actually run the experiment on a distributed machine - that’s why, for instance, HER uses MPI over SubprocVecEnv. For TRPO and PPO1 the choice could have been done either way; in fact, PPO2 can use both. I don’t know relative latency of MPI communication over pipes, I suspect that those should be similar; but have never measured it / seen the measurements.