[rllib] Slowly running out of memory in eager + tracing
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- Ray installed from (source or binary): source
- Ray version: 0.6.5
- Python version: 3.6.8
- Exact command to reproduce: rllib train --run=APEX --env=BreakoutNoFrameskip-v4 --ray-object-store-memory 10000000000
Describe the problem
The Agent class slowly grows in memory until running out. Happens also in APPO (It takes ~10M steps in the above atari command line, but with my own ENV which has a larger observation space and consumes a lot of RAM itself, it happens faster).
The memory usage starts at around ~32GB (Out of 64GB), and then slowly grows to 64GB over 10M steps until crashing.
Source code / logs
2019-03-28 14:59:16,458 ERROR trial_runner.py:460 -- Error processing event.
Traceback (most recent call last):
File "/home/opher/ray_0.6.5/python/ray/tune/trial_runner.py", line 409, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/opher/ray_0.6.5/python/ray/tune/ray_trial_executor.py", line 314, in fetch_result
result = ray.get(trial_future[0])
File "/home/opher/ray_0.6.5/python/ray/worker.py", line 2316, in get
raise value
ray.exceptions.RayTaskError: ray_ApexAgent:train() (pid=44603, host=osrv)
File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 316, in train
raise e
File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 305, in train
result = Trainable.train(self)
File "/home/opher/ray_0.6.5/python/ray/tune/trainable.py", line 151, in train
result = self._train()
File "/home/opher/ray_0.6.5/python/ray/rllib/agents/dqn/dqn.py", line 261, in _train
self.optimizer.step()
File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
sample_timesteps, train_timesteps = self._step()
File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 188, in _step
counts = ray.get([c[1][1] for c in completed])
ray.exceptions.RayTaskError: ray_PolicyEvaluator:sample_with_count() (pid=44621, host=osrv)
File "/home/opher/ray_0.6.5/python/ray/memory_monitor.py", line 77, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node osrv is used (64.09 / 67.46 GB). The top 5 memory consumers are:
PID MEM COMMAND
44603 34.89GB ray_ApexAgent:train()
44591 12.94GB ray_ReplayActor:add_batch()
44612 12.91GB ray_ReplayActor:add_batch()
44617 12.83GB ray_ReplayActor:add_batch()
44632 12.83GB ray_ReplayActor:add_batch()
In addition, ~10.46 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.
The above numbers can’t be real as I have only 64GB on my machine. These are also the numbers seen in ‘top’ under the ‘RES’ column, but I think it somehow includes also the SHR memory (Which was around 10GB for each of the above processes), so probably the actual numbers are ~24GB for the agent and ~3GB for each replay actor.
Issue Analytics
- State:
- Created 4 years ago
- Comments:22 (10 by maintainers)
Top Results From Across the Web
Help debugging a memory leak in rllib - Ray.io
I'm trying to debug a very slow memory leak in rllib that occurs when i am using IMPALA + multi-agent. I cannot find...
Read more >RLLib Baselines on Colab! | Posts - AIcrowd
Set this only if synchronous checkpointing is too slow and trial ... --trace Whether to attempt to enable tracing for eager mode.
Read more >RLlib trainer common config - Every little gist
Ray (0.8.2) RLlib trainer common config from: ... "eager": False, # Enable tracing in eager mode. ... Otherwise, the trainer runs in the...
Read more >Critical memory leaks and problems with NDB queries
However, this causes the error "The datastore operation timed out, or the data ... My issue appears more like a slow memory leak...
Read more >Flexible Computation Graphs for Deep Reinforcement Learning
Ray RLlib seeks to simplify ... graphs and define-by-run execution, and RLgraph currently ... and deployment of the entire RL program but can...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That script run correctly. I am sorry for my carelessness because I calculate the memory consumption of my case and find that the batch size is too large… So consuming so much memory is reasonable. Blame on the insufficient memory.
@SamShowalter
Hi, For TF 2.x the memory leak may be due to the varying length of the batch size. Please read my workaround.