Ray RunOutOfMemoryError is consistently raised for Apex-DQN
See original GitHub issueSystem information
- OS Platform and Distribution:
Ubuntu 16.04 Docker container with 11GB RAM and 9 CPU Cores without GPUs, the docker container is running on the Mac.
- Ray installed from (source or binary):
pip install
- Ray version:
pip install ray==0.7.5
- Tensorflow version:
1.14.0
- Python version:
python 3.6.8
- Exact command to reproduce:
python3 test_rllib_apex_dqn.py
Describe the problem
This is the script python3 test_rllib_apex_dqn.py
that I tried to train Apex-DQN with 3 workers for OpenAIGym Cartpole simulators.
import ray
import ray.rllib.agents.dqn.apex as apex
from ray.tune.logger import pretty_print
ray.init(object_store_memory=6000000000, num_cpus=8)
config = apex.APEX_DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 3
agent = apex.ApexTrainer(config=config, env="CartPole-v0")
# Can optionally call agent.restore(path) to load a checkpoint.
MAX_ITERS = 300000000
for i in range(MAX_ITERS):
result = agent.train()
print(pretty_print(result))
After it was trained ~10mins, it ran out of memory:
2019-10-22 20:17:01,957 INFO trainer.py:414 -- Worker crashed during call to train(). To attempt to continue training without the failed worker, set `'ignore_worker_failures': True`.
Traceback (most recent call last):
File "profile_ray.py", line 46, in <module>
result = agent.train()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 417, in train
raise e
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 406, in train
result = Trainable.train(self)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 176, in train
result = self._train()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
fetches = self.optimizer.step()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 142, in step
sample_timesteps, train_timesteps = self._step()
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/async_replay_optimizer.py", line 213, in _step
counts = ray_get_and_free([c[1][1] for c in completed])
File "/usr/local/lib/python3.6/dist-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
result = ray.get(object_ids)
File "/usr/local/lib/python3.6/dist-packages/ray/worker.py", line 2349, in get
raise value
ray.exceptions.RayTaskError: ray_RolloutWorker:sample_with_count() (pid=152, host=49232830b44b)
File "/usr/local/lib/python3.6/dist-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node 49232830b44b is used (12.06 / 12.7 GB). The top 10 memory consumers are:
PID MEM COMMAND
147 0.34GiB ray_ReplayActor:update_priorities()
146 0.33GiB ray_ReplayActor:update_priorities()
148 0.33GiB ray_ReplayActor:replay()
151 0.33GiB ray_ReplayActor:replay()
101 0.19GiB python3 profile_ray.py
153 0.16GiB ray_RolloutWorker:sample_with_count()
152 0.16GiB ray_RolloutWorker:sample_with_count()
150 0.16GiB ray_RolloutWorker:sample_with_count()
125 0.13GiB /usr/local/lib/python3.6/dist-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:48974
149 0.12GiB ray_worker
In addition, up to 4.27 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.
The issue could be reproduced running every time on the local Mac.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Good news, I think I found the leak. It is since we are creating new TF assign op in Python on each update target, and TensorFlow never garbage collects these operations.
For reference, I was able to identify the issue by running
py-spy dump
on the APEX trainer process. It showed some stack traces running_extend_graph
in TensorFlow which should never be happening after the initial policy creation.Can you try this out?
@ericl Also, even we use the node withe much more memory, the memory usage is still going up forever, and it would still hit the memory issue some point during the training time (if we leave training max iterations very large).