Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray RunOutOfMemoryError is consistently raised for Apex-DQN

See original GitHub issue

System information

OS Platform and Distribution: Ubuntu 16.04 Docker container with 11GB RAM and 9 CPU Cores without GPUs, the docker container is running on the Mac.
Ray installed from (source or binary): pip install
Ray version: pip install ray==0.7.5
Tensorflow version: 1.14.0
Python version: python 3.6.8
Exact command to reproduce: python3 test_rllib_apex_dqn.py

Describe the problem

This is the script python3 test_rllib_apex_dqn.py that I tried to train Apex-DQN with 3 workers for OpenAIGym Cartpole simulators.

import ray
import ray.rllib.agents.dqn.apex as apex
from ray.tune.logger import pretty_print


ray.init(object_store_memory=6000000000, num_cpus=8)
config = apex.APEX_DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 3
agent = apex.ApexTrainer(config=config, env="CartPole-v0")

# Can optionally call agent.restore(path) to load a checkpoint.
MAX_ITERS = 300000000


for i in range(MAX_ITERS):
   result = agent.train()
   print(pretty_print(result))

After it was trained ~10mins, it ran out of memory:

2019-10-22 20:17:01,957	INFO trainer.py:414 -- Worker crashed during call to train(). To attempt to continue training without the failed worker, set `'ignore_worker_failures': True`.
Traceback (most recent call last):
  File "profile_ray.py", line 46, in <module>
    result = agent.train()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 417, in train
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 406, in train
    result = Trainable.train(self)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 142, in step
    sample_timesteps, train_timesteps = self._step()
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 213, in _step
    counts = ray_get_and_free([c[1][1] for c in completed])
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
    result = ray.get(object_ids)
  File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2349, in get
    raise value
ray.exceptions.RayTaskError: ray_RolloutWorker:sample_with_count() (pid=152, host=49232830b44b)
  File "/usr/local/lib/python3.6/dist-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 49232830b44b is used (12.06 / 12.7 GB). The top 10 memory consumers are:

PID	MEM	COMMAND
147	0.34GiB	ray_ReplayActor:update_priorities()
146	0.33GiB	ray_ReplayActor:update_priorities()
148	0.33GiB	ray_ReplayActor:replay()
151	0.33GiB	ray_ReplayActor:replay()
101	0.19GiB	python3 profile_ray.py
153	0.16GiB	ray_RolloutWorker:sample_with_count()
152	0.16GiB	ray_RolloutWorker:sample_with_count()
150	0.16GiB	ray_RolloutWorker:sample_with_count()
125	0.13GiB	/usr/local/lib/python3.6/dist-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:48974
149	0.12GiB	ray_worker

In addition, up to 4.27 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

The issue could be reproduced running every time on the local Mac.

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

2reactions

ericlcommented, Oct 23, 2019

Good news, I think I found the leak. It is since we are creating new TF assign op in Python on each update target, and TensorFlow never garbage collects these operations.

For reference, I was able to identify the issue by running py-spy dump on the APEX trainer process. It showed some stack traces running _extend_graph in TensorFlow which should never be happening after the initial policy creation.

Can you try this out?

0reactions

RuofanKongcommented, Oct 22, 2019

@ericl Also, even we use the node withe much more memory, the memory usage is still going up forever, and it would still hit the memory issue some point during the training time (if we leave training max iterations very large).