Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Slowly running out of memory in eager + tracing

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
Ray installed from (source or binary): source
Ray version: 0.6.5
Python version: 3.6.8
Exact command to reproduce: rllib train --run=APEX --env=BreakoutNoFrameskip-v4 --ray-object-store-memory 10000000000

Describe the problem

The Agent class slowly grows in memory until running out. Happens also in APPO (It takes ~10M steps in the above atari command line, but with my own ENV which has a larger observation space and consumes a lot of RAM itself, it happens faster).

The memory usage starts at around ~32GB (Out of 64GB), and then slowly grows to 64GB over 10M steps until crashing.

Source code / logs


2019-03-28 14:59:16,458 ERROR trial_runner.py:460 -- Error processing event.
Traceback (most recent call last):
  File "/home/opher/ray_0.6.5/python/ray/tune/trial_runner.py", line 409, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/opher/ray_0.6.5/python/ray/tune/ray_trial_executor.py", line 314, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/opher/ray_0.6.5/python/ray/worker.py", line 2316, in get
    raise value
ray.exceptions.RayTaskError: ray_ApexAgent:train() (pid=44603, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 316, in train
    raise e
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/agent.py", line 305, in train
    result = Trainable.train(self)
  File "/home/opher/ray_0.6.5/python/ray/tune/trainable.py", line 151, in train
    result = self._train()
  File "/home/opher/ray_0.6.5/python/ray/rllib/agents/dqn/dqn.py", line 261, in _train
    self.optimizer.step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 118, in step
    sample_timesteps, train_timesteps = self._step()
  File "/home/opher/ray_0.6.5/python/ray/rllib/optimizers/async_replay_optimizer.py", line 188, in _step
    counts = ray.get([c[1][1] for c in completed])
ray.exceptions.RayTaskError: ray_PolicyEvaluator:sample_with_count() (pid=44621, host=osrv)
  File "/home/opher/ray_0.6.5/python/ray/memory_monitor.py", line 77, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node osrv is used (64.09 / 67.46 GB). The top 5 memory consumers are:

PID     MEM     COMMAND
44603   34.89GB ray_ApexAgent:train()
44591   12.94GB ray_ReplayActor:add_batch()
44612   12.91GB ray_ReplayActor:add_batch()
44617   12.83GB ray_ReplayActor:add_batch()
44632   12.83GB ray_ReplayActor:add_batch()

In addition, ~10.46 GB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`.

The above numbers can’t be real as I have only 64GB on my machine. These are also the numbers seen in ‘top’ under the ‘RES’ column, but I think it somehow includes also the SHR memory (Which was around 10GB for each of the above processes), so probably the actual numbers are ~24GB for the agent and ~3GB for each replay actor.

Issue Analytics

State:
Created 4 years ago
Comments:22 (10 by maintainers)

Top GitHub Comments

1reaction

pengzhenghaocommented, Apr 6, 2019

That script run correctly. I am sorry for my carelessness because I calculate the memory consumption of my case and find that the batch size is too large… So consuming so much memory is reasonable. Blame on the insufficient memory.

0reactions

GoingMyWaycommented, Jul 25, 2021

@GoingMyWay for the package I am building I was hoping to find a solution that is extensible to all supported agents and not need to extend the policy optimizer for each of them. Is there any identification of the source of this memory leak? Is it specific to PPO and APPO?

@SamShowalter

Hi, For TF 2.x the memory leak may be due to the varying length of the batch size. Please read my workaround.