[Question] PPO exhausts memory
See original GitHub issueImportant Note: We do not do technical support, nor consulting and don’t answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.
Question
I’m using PPO with a robotics library called iGibson. Here’s the sample code I have issue with.
num_environments = 8
env = SubprocVecEnv([make_env(i) for i in range(num_environments)])
env = VecMonitor(env)
...
model = PPO("MultiInputPolicy", env, verbose=1, tensorboard_log=tensorboard_log_dir, policy_kwargs=policy_kwargs)
...
model.learn(<1 million time steps>)
After the first iteration and it prints out the rollout information, the process would try to allocate large amount of memory that my 64 gb RAM + 100 gb SWAP are exhausted.
Killed by daemon when out of memory.
I noticed that decreasing the n_steps
will mitigate this issue, but it will not converge and the trained model will have poor quality. Reducing the number of parallel environments also helps, but it’s not a good idea for PPO since it’s on-policy training.
What exactly is the code doing that exhausts so much memory? What other metrics should I look at to avoid the overwhelming memory usage?
Thank you
Update: Here is the custom environment that I use. The code is too long to paste here so I will just leave a url. I’m still new to the baselines library. When the memory exhausts, the system hangs so it’s a little difficult to debug. My main questions are the bolded text. thanks.
Update 2: In my case, the observation space is consisted of 2 parts: 640x480 image from rgb camera, and 4-dimensional task observations including goal location, current location, etc (this is a navigation task).
The action space is a continuous Box [-1, 1] that controls the differential drive controller of the agent (robot) to move around.
Additional context
CPU: i7-10700 GPU: RTX A2000 12 GB 64 gb RAM 100 gb SWAP Torch 1.10.2 Stable-Baselines3 1.4.0
Checklist
- I have read the documentation (required)
- I have checked that there is no similar issue in the repo (required)
Issue Analytics
- State:
- Created a year ago
- Comments:9 (3 by maintainers)
Top GitHub Comments
I found a solution to my issue, however I can’t really say if this is the same issue brought up by @genkv. It could still definitely help with memory allocated by PyTorch though.
Thank you @Miffyli, when I noticed it also committed the same absurd amount of memory on simply calling
import torch
, I did some digging and it’s actually an issue caused by Nvidia fatbins (.nv_fatb) being loaded into memory, not by PyTorch specifically.The following info is from this Stack Overflow answer:
The answer also provides a python script which is intended to be run on your
Lib\site-packages\torch\lib\
directory. It scans through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.The last important thing I can think to note is that Nvidia plans to set the .nv_fatb section to read-only in the next major CUDA release (11.7) according to the answer:
After I ran the Python script, running
import torch
went from committing 2.8-2.9 GB of ram to 1.1-1.2 GB of ram, and my vectorized environments which would each commit 2.8-2.9 GB now only commit 1.1-1.2 GB each.Hopefully this helps somebody!
I do not know the details of “commit size”, but if that includes everything Python has loaded, then big part comes from PyTorch which is 1-2GB. I would guess you get the same result if you import just
import torch
(or create something small with cuda after import, e.g.x = torch.rand(5).cuda()
.Yes, this is how multiprocessing works in Python in general 😃. But indeed the way processes are spawned in different systems differs, and Windows has been especially tricky at times.