Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Distributed learning constantly crashing due to memory

See original GitHub issue

Hi, I keep on getting memory error crashes. This happens with both, my own implementations as well as the pre-defined rl-experiments. I have 16 GB of RAM available:

ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node Local is used (15.98 / 16.82 GB). The top 5 memory consumers are:

PID	MEM	COMMAND
4758	7.3GB	ray_ImpalaTrainer:train()
5113	4.5GB	ray_PolicyEvaluator
4753	4.5GB	ray_PolicyEvaluator:apply()
4761	4.49GB	ray_PolicyEvaluator:apply()
4756	4.49GB	ray_PolicyEvaluator:apply()

In addition, ~4.24 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.

How can I fix this?

I’m also surprised by the large amount of memory used by each of the workers. In my own implementation I made sure that it never holds much memory. Why on earth are the individual worker RAM usages slowly growing all the time?

Note 1: I DID obviously try to reduce the object_store_memory to something smaller, but eventually the workers all grow too large in RAM usage and it crashes anyways.

Note 2: For the above output I was using the rllib train -f pong-speedrun/pong-impala-fast.yaml example, except I adapted it for 8 workers.

Issue Analytics

State:
Created 4 years ago
Comments:12 (6 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, May 29, 2019

That’s actually expected unless you restrict the object store size. It seems you started off with 7GB memory used, which explains why you run out (Ray assumes it always can allocate 100% of the machine memory to itself by default). Can you reproduce with object store memory limited to say 500MB?

Also note that RLlib uses rllib.utils.ray_get_and_free() to optimize its internal memory usage by explicitly freeing memory, which isn’t done in this example.

Eric

On Tue, May 28, 2019 at 4:45 PM NikEyX notifications@github.com wrote:

I managed to create a small script for you that successfully replicates the issue and doesn’t make any use of TensorFlow or PyTorch:

import psutil import numpy as np import ray import time

PARAMETER_SIZE = 100000

ray.init()

@ray.remote class ParameterServer(object): def init(self): self.params = np.zeros(PARAMETER_SIZE)
def get_params(self):
    return self.params

def update_params(self, grad):
    self.params += grad
@ray.remote class Worker(): def init(self, ps): self.ps = ps
def RunEpisode(self):
    # we don't need the parameters, this is just to illustrate
    parameters = ray.get(self.ps.get_params.remote())

    grad = np.random.randn(PARAMETER_SIZE)

    self.ps.update_params.remote(grad)
parameter_server = ParameterServer.remote() workers = [ Worker.remote(parameter_server) for _ in range(8)]

lastMemory = 0 lastPrint = 0 while True: runEpisodes = [ worker.RunEpisode.remote() for worker in workers ]
ray.get(runEpisodes)
ray.get(parameter_server.get_params.remote())

if (time.time() - 1 > lastPrint):
    total_gb = psutil.virtual_memory().total / 1e9
    used_gb = total_gb - psutil.virtual_memory().available / 1e9

    print(f'Memory usage is {used_gb:.4f} / {total_gb:.4f} = {used_gb - lastMemory:+.4f}')

    lastMemory = used_gb
    lastPrint = time.time()
The output is as follows:

Memory usage is 7.1262 / 16.8164 = +7.1262 Memory usage is 7.8350 / 16.8164 = +0.7088 Memory usage is 8.5215 / 16.8164 = +0.6864 Memory usage is 9.1583 / 16.8164 = +0.6369 Memory usage is 9.7822 / 16.8164 = +0.6239 Memory usage is 10.3920 / 16.8164 = +0.6098 Memory usage is 11.0299 / 16.8164 = +0.6379 Memory usage is 11.7228 / 16.8164 = +0.6929 Memory usage is 12.2452 / 16.8164 = +0.5224 Memory usage is 12.2513 / 16.8164 = +0.0061 Memory usage is 12.2615 / 16.8164 = +0.0103 Memory usage is 12.2662 / 16.8164 = +0.0046 Memory usage is 12.2719 / 16.8164 = +0.0058 Memory usage is 12.2768 / 16.8164 = +0.0049 Memory usage is 12.2807 / 16.8164 = +0.0039 … Memory usage is 15.9316 / 16.8164 = +0.0005 Memory usage is 15.9335 / 16.8164 = +0.0019 Memory usage is 15.9373 / 16.8164 = +0.0038 Memory usage is 15.9401 / 16.8164 = +0.0028 Memory usage is 15.9436 / 16.8164 = +0.0035 Memory usage is 15.9476 / 16.8164 = +0.0040 Memory usage is 15.9499 / 16.8164 = +0.0024 Memory usage is 15.9558 / 16.8164 = +0.0059 Memory usage is 15.9603 / 16.8164 = +0.0045 Memory usage is 15.9670 / 16.8164 = +0.0068 Memory usage is 15.9724 / 16.8164 = +0.0054 Memory usage is 15.9708 / 16.8164 = -0.0016 Memory usage is 15.9756 / 16.8164 = +0.0048 Traceback (most recent call last): File “/home/XXXXX/Projects/memoryTest.py”, line 45, in <module> ray.get(runEpisodes) File “/home/XXXXX/.local/lib/python3.7/site-packages/ray/worker.py”, line 2197, in get raise value ray.exceptions.RayTaskError: ray_Worker:RunEpisode() (pid=23262, host=Local) File “/home/XXXXX/.local/lib/python3.7/site-packages/ray/memory_monitor.py”, line 77, in raise_if_low_memory self.error_threshold)) ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node Local is used (15.98 / 16.82 GB). The top 5 memory consumers are:

PID MEM COMMAND 23256 5.22GB ray_ParameterServer 23249 5.2GB ray_Worker 23259 5.2GB ray_Worker:RunEpisode() 23254 5.2GB ray_Worker 23250 5.2GB ray_Worker:RunEpisode()

In addition, ~5.3 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory parameter when starting Ray, and the max Redis size with redis_max_memory.

Can you confirm this happens to you as well? All memory should be released after RunEpisode has executed, so why does the memory keep on increasing?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/4877?email_source=notifications&email_token=AAADUSSTPTX753P2IGB4Z4DPXW7ZRA5CNFSM4HP5LSXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWNYBQQ#issuecomment-496730306, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSSW2O4OTFK43AJB3ELPXW7ZRANCNFSM4HP5LSXA .