question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Slowly running out of memory in eager + tracing

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): source
  • Ray version: 0.6.5
  • Python version: 3.6.8
  • Exact command to reproduce: rllib train --run=APEX --env=BreakoutNoFrameskip-v4 --ray-object-store-memory 10000000000

Describe the problem

The Agent class slowly grows in memory until running out. Happens also in APPO (It takes ~10M steps in the above atari command line, but with my own ENV which has a larger observation space and consumes a lot of RAM itself, it happens faster).

The memory usage starts at around ~32GB (Out of 64GB), and then slowly grows to 64GB over 10M steps until crashing.

Source code / logs


2019-03-28 14:59:16,458 ERROR trial_runner.py:460 -- Error processing event.
Traceback (most recent call last):
  File "/home/opher/ray_0.6.5/python/ray/tune/trial_runner.py", line 409, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/opher/ray_0.6.5/python/ray/tune/ray_trial_executor.py", line 314, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/opher/ray_0.6.5/python/ray/worker.py", line 2316, in get
    raise value
ray.exceptions.RayTaskError: ray_ApexAgent:train() (pid=44603, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 316, in train
    raise e
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 305, in train
    result = Trainable.train(self)
  File "/home/opher/ray_0.6.5/python/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/dqn/dqn.py", line 261, in _train
    self.optimizer.step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
    sample_timesteps, train_timesteps = self._step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 188, in _step
    counts = ray.get([c[1][1] for c in completed])
ray.exceptions.RayTaskError: ray_PolicyEvaluator:sample_with_count() (pid=44621, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/memory_monitor.py", line 77, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node osrv is used (64.09 / 67.46 GB). The top 5 memory consumers are:

PID     MEM     COMMAND
44603   34.89GB ray_ApexAgent:train()
44591   12.94GB ray_ReplayActor:add_batch()
44612   12.91GB ray_ReplayActor:add_batch()
44617   12.83GB ray_ReplayActor:add_batch()
44632   12.83GB ray_ReplayActor:add_batch()

In addition, ~10.46 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.

The above numbers can’t be real as I have only 64GB on my machine. These are also the numbers seen in ‘top’ under the ‘RES’ column, but I think it somehow includes also the SHR memory (Which was around 10GB for each of the above processes), so probably the actual numbers are ~24GB for the agent and ~3GB for each replay actor.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:22 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
pengzhenghaocommented, Apr 6, 2019

That script run correctly. I am sorry for my carelessness because I calculate the memory consumption of my case and find that the batch size is too large… So consuming so much memory is reasonable. Blame on the insufficient memory.

0reactions
GoingMyWaycommented, Jul 25, 2021

@GoingMyWay for the package I am building I was hoping to find a solution that is extensible to all supported agents and not need to extend the policy optimizer for each of them. Is there any identification of the source of this memory leak? Is it specific to PPO and APPO?

@SamShowalter

Hi, For TF 2.x the memory leak may be due to the varying length of the batch size. Please read my workaround.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Help debugging a memory leak in rllib - Ray.io
I'm trying to debug a very slow memory leak in rllib that occurs when i am using IMPALA + multi-agent. I cannot find...
Read more >
RLLib Baselines on Colab! | Posts - AIcrowd
Set this only if synchronous checkpointing is too slow and trial ... --trace Whether to attempt to enable tracing for eager mode.
Read more >
RLlib trainer common config - Every little gist
Ray (0.8.2) RLlib trainer common config from: ... "eager": False, # Enable tracing in eager mode. ... Otherwise, the trainer runs in the...
Read more >
Critical memory leaks and problems with NDB queries
However, this causes the error "The datastore operation timed out, or the data ... My issue appears more like a slow memory leak...
Read more >
Flexible Computation Graphs for Deep Reinforcement Learning
Ray RLlib seeks to simplify ... graphs and define-by-run execution, and RLgraph currently ... and deployment of the entire RL program but can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found