[rllib] Training interrupted by RayOutOfMemoryError
See original GitHub issueI’m executing a PPOTrainer
with a custom environment I wrote, after some iterations (usually ~2k) the training stops with a RayOutOfMemoryError
Traceback (most recent call last):
File "ppo.py", line 37, in <module>
trainer.train()
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 418, in train
raise e
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 407, in train
result = Trainable.train(self)
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/tune/trainable.py", line 176, in train
result = self._train()
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
fetches = self.optimizer.step()
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 140, in step
self.num_envs_per_worker, self.train_batch_size)
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/optimizers/rollout.py", line 29, in collect_samples
next_sample = ray_get_and_free(fut_sample)
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
result = ray.get(object_ids)
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/worker.py", line 2121, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray_RolloutWorker:sample() (pid=5283, host=UBUNTU)
File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node UBUNTU is used (14.9 / 15.67 GB). The top 10 memory consumers are:
PID MEM COMMAND
5181 6.83GiB python ppo.py
4889 0.51GiB /home/devid/.vscode/extensions/ms-python.python-2019.11.50794/languageServer.0.5.10/Microsoft.Python
1778 0.24GiB /usr/bin/gnome-shell
5283 0.23GiB ray_RolloutWorker:sample()
5276 0.18GiB ray_worker
5290 0.18GiB ray_worker
5289 0.18GiB ray_worker
5282 0.18GiB ray_worker
5286 0.18GiB ray_worker
In addition, up to 2.12 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.
This is how I initialize and run my trainer:
import rllib_wrapper.callbacks as cb
from rllib_wrapper.flatland_wrapper import FlatlandEnv
from rllib_wrapper.custom_preprocessor import TreeObsPreprocessor
ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
trainer = PPOTrainer(env=FlatlandEnv, config={
"num_workers": 1,
"train_batch_size": 4000,
"model": {
"custom_preprocessor": "tree_obs_prep"
},
"callbacks": {
"on_episode_end": cb.on_episode_end,
"on_train_result": cb.on_train_result,
},
"log_level": "ERROR"
})
for i in range(100000 + 2):
trainer.train()
This is my custom environment
class FlatlandEnv(rllib.env.MultiAgentEnv):
def __init__(self, env_config):
self.env = RailEnv(...)
self.action_space = gym.spaces.Discrete(5)
self.observation_space = np.zeros((1, 231))
def reset(self):
self.agents_done = []
obs = self.env.reset()
return obs[0]
def step(self, action_dict):
obs, rewards, dones, infos = self.env.step(action_dict)
d = dict()
r = dict()
o = dict()
i = dict()
for a in range(len(self.env.agents)):
if a not in self.agents_done:
o[a] = obs[a]
r[a] = rewards[a]
d[a] = dones[a]
i[a] = '...'
d['__all__'] = dones['__all__']
for agent, done in dones.items():
if done and agent != '__all__':
self.agents_done.append(agent)
return o, r, d, i
And this is my preprocessor:
class TreeObsPreprocessor(Preprocessor):
def _init_shape(self, obs_space, options):
self.step_memory = 2 # TODO options["custom_options"]["step_memory"]
self.tree_depth = 2
return sum([space.shape[0] for space in obs_space]),
def transform(self, obs):
if obs:
ret = normalize_observation(obs, self.tree_depth, observation_radius=10)
else:
ret = np.zeros(231)
return ret
The full code is available here and those are my system’s informations:
OS: Ubuntu 18.04 x86_64
ray: 0.7.6
tensorflow: 2.0.0
python: 3.7.0
I already tried those solutions but none of them worked:
- Lowering the
train_batch_size
- Setting the memory parameters as suggested by the error message:
ray.init( memory=8000000000, redis_max_memory=8000000000, object_store_memory=8000000000, )
I’m not experienced with ray and RL in general, could you help me understand why this error happens and how to fix it? Thanks in advance
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (5 by maintainers)
Top Results From Across the Web
Getting Started with RLlib — Ray 2.2.0 - the Ray documentation
In case you are using lots of workers ( num_workers >> 10 ) and you observe worker failures for whatever reasons, which normally...
Read more >Stopping and Resuming a Tune Run - the Ray documentation
Tune first looks at the experiment-level checkpoint to find the list of trials at the time of the interruption. Tune then locates and...
Read more >How does Tune work? — Ray 2.2.0
While the Trainable is executing (The execution of a trainable), the Tune Driver communicates with each actor via actor methods to receive intermediate...
Read more >Execution (Tuner, tune.Experiment) — Ray 2.2.0
... Ray TrainScale machine learning training · Ray TuneScale hyperparameter tuning · Ray ServeScale model serving · Ray RLlibScale reinforcement learning.
Read more >PPO trainer eating up memory - RLlib - Ray
Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@misterdev for me this issue went away on TF 1.15.0 after upgrading from 1.14.0. It’s possible there’s a new leak that got introduced between 1.15.0 and 2.0.0.
I’m just going through and trying to remove things from the algorithm one at a time until I find where the leak is. I’ll let you know if I figure it out.