[Question] VecEnv GPU optimizations
See original GitHub issueQuestion
Are the vector envs in stable-baselines3 GPU-optimizable? I note that models can have their parameters loaded into GPU memory with the device
attribute. However, during training tensors between the policy and the env undergo conversions between GPU <> CPU as well as PyTorch <> NumPy.
Additional context
For example in OnPolicyAlgorithm.collect_rollouts():
with th.no_grad():
# Convert to pytorch tensor
obs_tensor = th.as_tensor(self._last_obs).to(self.device)
actions, values, log_probs = self.policy.forward(obs_tensor)
actions = actions.cpu().numpy() # <--
# Rescale and perform action
clipped_actions = actions
# Clip the actions to avoid out of bound error
if isinstance(self.action_space, gym.spaces.Box):
clipped_actions = np.clip(actions, self.action_space.low, self.action_space.high)
new_obs, rewards, dones, infos = env.step(clipped_actions)
the actions
tensor is removed from GPU and converted to a NumPy array. It seems that if there was a VecEnv
that supported tensors then this step could be forgoed and the propagation could stay on the Cuda device. Unless I am misinterpreting something.
Checklist
- I have read the documentation (required)
- I have checked that there is no similar issue in the repo (required)
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (1 by maintainers)
Top Results From Across the Web
The 37 Implementation Details of Proximal Policy Optimization
According to a GitHub issue, one maintainer suggests ppo2 should offer better GPU utilization by batching observations from multiple simulation ...
Read more >FAQ — ElegantRL 0.3.1 documentation
This document contains the most frequently asked questions related ... GPU_ids to None (you cannot use GPU-accelerated VecEnv in this case).
Read more >Question about Vectorized Environments and GPU/Cuda ...
Question about Vectorized Environments and GPU/Cuda training. I have a bunch of questions that I can't seem to find any good answers to, ......
Read more >Stable Baselines Documentation - Read the Docs
A best practice when you apply RL to a new problem is to do automatic hyperparameter optimization. Again, this is included in the...
Read more >VectorEnv API — Ray 2.2.0
rllib.env.vector_env.VectorEnv# · make_env – Factory that produces a new gym. · existing_envs – Optional list of already instantiated sub environments. · num_envs ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Wow! Thanks all for the discussion. Seems like I’ve just uncovered the tip of the iceberg 🤓
Heeding the advice, I decided to publish my code to a fork here for any curious future readers. No doubt I will be continuing the discussion in other places, but will close the issue from here for now 👍
Thanks for the pointers! I managed to get some benchmarks. I trained the
PPO
model on a customVecEnv
version of cartpole that vectorizes thestep()
andreset()
methods. That eliminated the loop inVecenv.step_wait()
. Then as you mentioned, there were some modifications to the rollout buffers to store tensors/compute advantage as well as numerous conversions fromnumpy
totorch
operations throughout the codebase to support it. I timed the execution ofPPO.learn()
a high number of environments across a few different batch sizes (all runs were done on an Nvidia TITAN RTX)which is a nice result, but these particular hyperparameters may be uncommon (particularly for cartpole). With a smaller numbers of parallel environments, the optimization is not so profound but still quicker. Average reward of the PPO was roughly the same.
The majority of the work is in building the tensor version of the environment - like you mentioned
stable-baselines
may not be the place for it. But it seems like It would speed up policy development.