[rllib] Distributed learning constantly crashing due to memory
See original GitHub issueHi, I keep on getting memory error crashes. This happens with both, my own implementations as well as the pre-defined rl-experiments. I have 16 GB of RAM available:
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node Local is used (15.98 / 16.82 GB). The top 5 memory consumers are:
PID MEM COMMAND
4758 7.3GB ray_ImpalaTrainer:train()
5113 4.5GB ray_PolicyEvaluator
4753 4.5GB ray_PolicyEvaluator:apply()
4761 4.49GB ray_PolicyEvaluator:apply()
4756 4.49GB ray_PolicyEvaluator:apply()
In addition, ~4.24 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.
How can I fix this?
I’m also surprised by the large amount of memory used by each of the workers. In my own implementation I made sure that it never holds much memory. Why on earth are the individual worker RAM usages slowly growing all the time?
Note 1: I DID obviously try to reduce the object_store_memory to something smaller, but eventually the workers all grow too large in RAM usage and it crashes anyways.
Note 2: For the above output I was using the rllib train -f pong-speedrun/pong-impala-fast.yaml
example, except I adapted it for 8 workers.
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (6 by maintainers)
Top Results From Across the Web
RAM Usage Keeps Going Up While Training an RL Network ...
While training a single-agent network, I have been experiencing some issues with exceeding RAM utilization. See TensorBoard screenshot below of ...
Read more >Models, Preprocessors, and Action Distributions — Ray 2.2.0
The following diagram provides a conceptual overview of data flow between different components in RLlib. We start with an Environment , which -...
Read more >Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
Sets the training related configuration. ... object store during your experiment via a call to ray memory on your headnode, and by using...
Read more >Fault Tolerance — Ray 2.2.0 - the Ray documentation
When a worker is executing a task, if the worker dies unexpectedly, either because the process crashed or because the machine failed, Ray...
Read more >Key Concepts — Ray 2.2.0
In RLlib, you use Algorithm 's to learn how to solve problem environments . ... RLlib Algorithm classes coordinate the distributed workflow of...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That’s actually expected unless you restrict the object store size. It seems you started off with 7GB memory used, which explains why you run out (Ray assumes it always can allocate 100% of the machine memory to itself by default). Can you reproduce with object store memory limited to say 500MB?
Also note that RLlib uses rllib.utils.ray_get_and_free() to optimize its internal memory usage by explicitly freeing memory, which isn’t done in this example.
Eric
On Tue, May 28, 2019 at 4:45 PM NikEyX notifications@github.com wrote:
Thank you