question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Distributed learning constantly crashing due to memory

See original GitHub issue

Hi, I keep on getting memory error crashes. This happens with both, my own implementations as well as the pre-defined rl-experiments. I have 16 GB of RAM available:

ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node Local is used (15.98 / 16.82 GB). The top 5 memory consumers are:

PID	MEM	COMMAND
4758	7.3GB	ray_ImpalaTrainer:train()
5113	4.5GB	ray_PolicyEvaluator
4753	4.5GB	ray_PolicyEvaluator:apply()
4761	4.49GB	ray_PolicyEvaluator:apply()
4756	4.49GB	ray_PolicyEvaluator:apply()

In addition, ~4.24 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.

How can I fix this?

I’m also surprised by the large amount of memory used by each of the workers. In my own implementation I made sure that it never holds much memory. Why on earth are the individual worker RAM usages slowly growing all the time?

Note 1: I DID obviously try to reduce the object_store_memory to something smaller, but eventually the workers all grow too large in RAM usage and it crashes anyways.

Note 2: For the above output I was using the rllib train -f pong-speedrun/pong-impala-fast.yaml example, except I adapted it for 8 workers.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, May 29, 2019

That’s actually expected unless you restrict the object store size. It seems you started off with 7GB memory used, which explains why you run out (Ray assumes it always can allocate 100% of the machine memory to itself by default). Can you reproduce with object store memory limited to say 500MB?

Also note that RLlib uses rllib.utils.ray_get_and_free() to optimize its internal memory usage by explicitly freeing memory, which isn’t done in this example.

Eric

On Tue, May 28, 2019 at 4:45 PM NikEyX notifications@github.com wrote:

I managed to create a small script for you that successfully replicates the issue and doesn’t make any use of TensorFlow or PyTorch:

import psutil import numpy as np import ray import time

PARAMETER_SIZE = 100000

ray.init()

@ray.remote class ParameterServer(object): def init(self): self.params = np.zeros(PARAMETER_SIZE)

def get_params(self):
    return self.params

def update_params(self, grad):
    self.params += grad

@ray.remote class Worker(): def init(self, ps): self.ps = ps

def RunEpisode(self):
    # we don't need the parameters, this is just to illustrate
    parameters = ray.get(self.ps.get_params.remote())

    grad = np.random.randn(PARAMETER_SIZE)

    self.ps.update_params.remote(grad)

parameter_server = ParameterServer.remote() workers = [ Worker.remote(parameter_server) for _ in range(8)]

lastMemory = 0 lastPrint = 0 while True: runEpisodes = [ worker.RunEpisode.remote() for worker in workers ]

ray.get(runEpisodes)
ray.get(parameter_server.get_params.remote())

if (time.time() - 1 > lastPrint):
    total_gb = psutil.virtual_memory().total / 1e9
    used_gb = total_gb - psutil.virtual_memory().available / 1e9

    print(f'Memory usage is {used_gb:.4f} / {total_gb:.4f} = {used_gb - lastMemory:+.4f}')

    lastMemory = used_gb
    lastPrint = time.time()

The output is as follows:

Memory usage is 7.1262 / 16.8164 = +7.1262 Memory usage is 7.8350 / 16.8164 = +0.7088 Memory usage is 8.5215 / 16.8164 = +0.6864 Memory usage is 9.1583 / 16.8164 = +0.6369 Memory usage is 9.7822 / 16.8164 = +0.6239 Memory usage is 10.3920 / 16.8164 = +0.6098 Memory usage is 11.0299 / 16.8164 = +0.6379 Memory usage is 11.7228 / 16.8164 = +0.6929 Memory usage is 12.2452 / 16.8164 = +0.5224 Memory usage is 12.2513 / 16.8164 = +0.0061 Memory usage is 12.2615 / 16.8164 = +0.0103 Memory usage is 12.2662 / 16.8164 = +0.0046 Memory usage is 12.2719 / 16.8164 = +0.0058 Memory usage is 12.2768 / 16.8164 = +0.0049 Memory usage is 12.2807 / 16.8164 = +0.0039 … Memory usage is 15.9316 / 16.8164 = +0.0005 Memory usage is 15.9335 / 16.8164 = +0.0019 Memory usage is 15.9373 / 16.8164 = +0.0038 Memory usage is 15.9401 / 16.8164 = +0.0028 Memory usage is 15.9436 / 16.8164 = +0.0035 Memory usage is 15.9476 / 16.8164 = +0.0040 Memory usage is 15.9499 / 16.8164 = +0.0024 Memory usage is 15.9558 / 16.8164 = +0.0059 Memory usage is 15.9603 / 16.8164 = +0.0045 Memory usage is 15.9670 / 16.8164 = +0.0068 Memory usage is 15.9724 / 16.8164 = +0.0054 Memory usage is 15.9708 / 16.8164 = -0.0016 Memory usage is 15.9756 / 16.8164 = +0.0048 Traceback (most recent call last): File “/home/XXXXX/Projects/memoryTest.py”, line 45, in <module> ray.get(runEpisodes) File “/home/XXXXX/.local/lib/python3.7/site-packages/ray/worker.py”, line 2197, in get raise value ray.exceptions.RayTaskError: ray_Worker:RunEpisode() (pid=23262, host=Local) File “/home/XXXXX/.local/lib/python3.7/site-packages/ray/memory_monitor.py”, line 77, in raise_if_low_memory self.error_threshold)) ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node Local is used (15.98 / 16.82 GB). The top 5 memory consumers are:

PID MEM COMMAND 23256 5.22GB ray_ParameterServer 23249 5.2GB ray_Worker 23259 5.2GB ray_Worker:RunEpisode() 23254 5.2GB ray_Worker 23250 5.2GB ray_Worker:RunEpisode()

In addition, ~5.3 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the object_store_memory parameter when starting Ray, and the max Redis size with redis_max_memory.

Can you confirm this happens to you as well? All memory should be released after RunEpisode has executed, so why does the memory keep on increasing?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/4877?email_source=notifications&email_token=AAADUSSTPTX753P2IGB4Z4DPXW7ZRA5CNFSM4HP5LSXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWNYBQQ#issuecomment-496730306, or mute the thread https://github.com/notifications/unsubscribe-auth/AAADUSSW2O4OTFK43AJB3ELPXW7ZRANCNFSM4HP5LSXA .

0reactions
NikEyXcommented, May 29, 2019

Thank you

Read more comments on GitHub >

github_iconTop Results From Across the Web

RAM Usage Keeps Going Up While Training an RL Network ...
While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ...
Read more >
Models, Preprocessors, and Action Distributions — Ray 2.2.0
The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an Environment , which -...
Read more >
Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
Sets the training related configuration. ... object store during your experiment via a call to ray memory on your headnode, and by using...
Read more >
Fault Tolerance — Ray 2.2.0 - the Ray documentation
When a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine failed, Ray...
Read more >
Key Concepts — Ray 2.2.0
In RLlib, you use Algorithm 's to learn how to solve problem environments . ... RLlib Algorithm classes coordinate the distributed workflow of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found