Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[rllib] Training interrupted by RayOutOfMemoryError

See original GitHub issue

I’m executing a PPOTrainer with a custom environment I wrote, after some iterations (usually ~2k) the training stops with a RayOutOfMemoryError

Traceback (most recent call last):
  File "ppo.py", line 37, in <module>
    trainer.train()
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 418, in train
    raise e
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 407, in train
    result = Trainable.train(self)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/tune/trainable.py", line 176, in train
    result = self._train()
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 129, in _train
    fetches = self.optimizer.step()
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 140, in step
    self.num_envs_per_worker, self.train_batch_size)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/optimizers/rollout.py", line 29, in collect_samples
    next_sample = ray_get_and_free(fut_sample)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/rllib/utils/memory.py", line 33, in ray_get_and_free
    result = ray.get(object_ids)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/worker.py", line 2121, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ray_RolloutWorker:sample() (pid=5283, host=UBUNTU)
  File "/home/devid/anaconda3/envs/baselines/lib/python3.7/site-packages/ray/memory_monitor.py", line 130, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node UBUNTU is used (14.9 / 15.67 GB). The top 10 memory consumers are:

PID     MEM     COMMAND
5181    6.83GiB python ppo.py
4889    0.51GiB /home/devid/.vscode/extensions/ms-python.python-2019.11.50794/languageServer.0.5.10/Microsoft.Python
1778    0.24GiB /usr/bin/gnome-shell
5283    0.23GiB ray_RolloutWorker:sample()
5276    0.18GiB ray_worker
5290    0.18GiB ray_worker
5289    0.18GiB ray_worker
5282    0.18GiB ray_worker
5286    0.18GiB ray_worker

In addition, up to 2.12 GiB of shared memory is currently being used by the Ray object store. You can set the object store size with the `object_store_memory` parameter when starting Ray, and the max Redis size with `redis_max_memory`. Note that Ray assumes all system memory is available for use by workers. If your system has other applications running, you should manually set these memory limits to a lower value.

This is how I initialize and run my trainer:

import rllib_wrapper.callbacks as cb
from rllib_wrapper.flatland_wrapper import FlatlandEnv
from rllib_wrapper.custom_preprocessor import TreeObsPreprocessor

ModelCatalog.register_custom_preprocessor("tree_obs_prep", TreeObsPreprocessor)
trainer = PPOTrainer(env=FlatlandEnv, config={
    "num_workers": 1,
    "train_batch_size": 4000,
    "model": {
        "custom_preprocessor": "tree_obs_prep"
    },
    "callbacks": {
        "on_episode_end": cb.on_episode_end,
        "on_train_result": cb.on_train_result,
    },
    "log_level": "ERROR"
})

for i in range(100000 + 2):
    trainer.train()

This is my custom environment

class FlatlandEnv(rllib.env.MultiAgentEnv):
    def __init__(self, env_config):
        self.env = RailEnv(...)
        self.action_space = gym.spaces.Discrete(5)
        self.observation_space = np.zeros((1, 231))

    def reset(self):
        self.agents_done = []
        obs = self.env.reset()
        return obs[0]

    def step(self, action_dict):
        obs, rewards, dones, infos = self.env.step(action_dict)
        d = dict()
        r = dict()
        o = dict()
        i = dict()
        for a in range(len(self.env.agents)):
            if a not in self.agents_done:            
                o[a] = obs[a]
                r[a] = rewards[a]
                d[a] = dones[a]
                i[a] = '...'
        d['__all__'] = dones['__all__']

        for agent, done in dones.items():
            if done and agent != '__all__':
                self.agents_done.append(agent)

        return  o, r, d, i

And this is my preprocessor:

class TreeObsPreprocessor(Preprocessor):
    def _init_shape(self, obs_space, options):
        self.step_memory = 2 # TODO options["custom_options"]["step_memory"]
        self.tree_depth = 2
        return sum([space.shape[0] for space in obs_space]),

    def transform(self, obs):
        if obs:
            ret = normalize_observation(obs, self.tree_depth, observation_radius=10)
        else:
            ret = np.zeros(231)
        
        return ret

The full code is available here and those are my system’s informations:

OS:  Ubuntu 18.04 x86_64
ray:  0.7.6
tensorflow:  2.0.0
python:  3.7.0

I already tried those solutions but none of them worked:

Lowering the train_batch_size

Setting the memory parameters as suggested by the error message:

ray.init(
    memory=8000000000,
    redis_max_memory=8000000000,
    object_store_memory=8000000000,
)

I’m not experienced with ray and RL in general, could you help me understand why this error happens and how to fix it? Thanks in advance

Issue Analytics

State:
Created 4 years ago
Comments:10 (5 by maintainers)

Top GitHub Comments

1reaction

eugenevinitskycommented, Dec 24, 2019

@misterdev for me this issue went away on TF 1.15.0 after upgrading from 1.14.0. It’s possible there’s a new leak that got introduced between 1.15.0 and 2.0.0.

1reaction

eugenevinitskycommented, Dec 23, 2019

I’m just going through and trying to remove things from the algorithm one at a time until I find where the leak is. I’ll let you know if I figure it out.

Top Results From Across the Web

Getting Started with RLlib — Ray 2.2.0 - the Ray documentation

In case you are using lots of workers ( num_workers >> 10 ) and you observe worker failures for whatever reasons, which normally...

Stopping and Resuming a Tune Run - the Ray documentation

Tune first looks at the experiment-level checkpoint to find the list of trials at the time of the interruption. Tune then locates and...

How does Tune work? — Ray 2.2.0

While the Trainable is executing (The execution of a trainable), the Tune Driver communicates with each actor via actor methods to receive intermediate...

Execution (Tuner, tune.Experiment) — Ray 2.2.0

... Ray TrainScale machine learning training · Ray TuneScale hyperparameter tuning · Ray ServeScale model serving · Ray RLlibScale reinforcement learning.

PPO trainer eating up memory - RLlib - Ray

Hi there, I'm trying to train a PPO agent via self play in my multi-agent env. At the moment it can manage about...