Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Question] PPO exhausts memory

See original GitHub issue

Important Note: We do not do technical support, nor consulting and don’t answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

Question

I’m using PPO with a robotics library called iGibson. Here’s the sample code I have issue with.

num_environments = 8
env = SubprocVecEnv([make_env(i) for i in range(num_environments)])
env = VecMonitor(env)

...

model = PPO("MultiInputPolicy", env, verbose=1, tensorboard_log=tensorboard_log_dir, policy_kwargs=policy_kwargs)

...

model.learn(<1 million time steps>)

After the first iteration and it prints out the rollout information, the process would try to allocate large amount of memory that my 64 gb RAM + 100 gb SWAP are exhausted.

middle_img_v2_6869692d-a5da-4092-b1db-67be0e73dc1g

Killed by daemon when out of memory.

I noticed that decreasing the n_steps will mitigate this issue, but it will not converge and the trained model will have poor quality. Reducing the number of parallel environments also helps, but it’s not a good idea for PPO since it’s on-policy training.

What exactly is the code doing that exhausts so much memory? What other metrics should I look at to avoid the overwhelming memory usage?

Thank you

Update: Here is the custom environment that I use. The code is too long to paste here so I will just leave a url. I’m still new to the baselines library. When the memory exhausts, the system hangs so it’s a little difficult to debug. My main questions are the bolded text. thanks.

Update 2: In my case, the observation space is consisted of 2 parts: 640x480 image from rgb camera, and 4-dimensional task observations including goal location, current location, etc (this is a navigation task).

The action space is a continuous Box [-1, 1] that controls the differential drive controller of the agent (robot) to move around.

Additional context

CPU: i7-10700 GPU: RTX A2000 12 GB 64 gb RAM 100 gb SWAP Torch 1.10.2 Stable-Baselines3 1.4.0

Checklist

I have read the documentation (required)
I have checked that there is no similar issue in the repo (required)

Issue Analytics

State:
Created a year ago
Comments:9 (3 by maintainers)

Top GitHub Comments

3reactions

AidanShipperleycommented, Apr 27, 2022

I found a solution to my issue, however I can’t really say if this is the same issue brought up by @genkv. It could still definitely help with memory allocated by PyTorch though.

Thank you @Miffyli, when I noticed it also committed the same absurd amount of memory on simply calling import torch, I did some digging and it’s actually an issue caused by Nvidia fatbins (.nv_fatb) being loaded into memory, not by PyTorch specifically.

The following info is from this Stack Overflow answer:

Several DLLs, such as cusolver64_xx.dll, torcha_cuda_cu.dll, and a few others, have .nv_fatb sections in them. These contain tons of different variations of CUDA code for different GPUs, so it ends up being several hundred megabytes to a couple gigabytes. When Python imports ‘torch’ it loads these DLLs, and maps the .nv_fatb section into memory. For some reason, instead of just being a memory mapped file, it is actually taking up memory. The section is set as ‘copy on write’ . . . if you look at Python using VMMap ( https://docs.microsoft.com/en-us/sysinternals/downloads/vmmap ) you can see that these DLLs are committing huge amounts of committed memory for this .nv_fatb section. The frustrating part is it doesn’t seem to be using the memory. For example, right now my Python.exe has 2.7GB committed, but the working set is only 148MB.

The answer also provides a python script which is intended to be run on your Lib\site-packages\torch\lib\ directory. It scans through all DLLs specified by the input glob, and if it finds an .nv_fatb section it will back up the DLL, disable ASLR, and mark the .nv_fatb section read-only.

ASLR is ‘address space layout randomization.’ It is a security feature that randomizes where a DLL is loaded in memory. We disable it for this DLL so that all Python processes will load the DLL into the same base virtual address. If all Python processes using the DLL load it at the same base address, they can all share the DLL. Otherwise each process needs its own copy.

Marking the section ‘read-only’ lets Windows know that the contents will not change in memory. If you map a file into memory read/write, Windows has to commit enough memory, backed by the pagefile, just in case you make a modification to it. If the section is read-only, there is no need to back it in the pagefile. We know there are no modifications to it, so it can always be found in the DLL.

The last important thing I can think to note is that Nvidia plans to set the .nv_fatb section to read-only in the next major CUDA release (11.7) according to the answer:

Per NVIDIA: “We have gone ahead and marked the nv_fatb section as read-only, this change will be targeting next major CUDA release 11.7 . We are not changing the ASLR, as that is considered a safety feature .”

After I ran the Python script, running import torch went from committing 2.8-2.9 GB of ram to 1.1-1.2 GB of ram, and my vectorized environments which would each commit 2.8-2.9 GB now only commit 1.1-1.2 GB each.

Hopefully this helps somebody!

1reaction

Miffylicommented, Apr 26, 2022

I do not know the details of “commit size”, but if that includes everything Python has loaded, then big part comes from PyTorch which is 1-2GB. I would guess you get the same result if you import just import torch (or create something small with cuda after import, e.g. x = torch.rand(5).cuda().

Multiprocessing on Windows

Yes, this is how multiprocessing works in Python in general 😃. But indeed the way processes are spawned in different systems differs, and Windows has been especially tricky at times.