Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Leaking Worker Memory ignores cap set by ray.init(memory=cap)

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18 LTS
Ray installed from (source or binary): pip installation
Ray version: 0.76
Python version: 3.6.8
Exact command to reproduce: -

Describe the problem

I am running 6 parallel workers collecting data by exploring gym environments, and saving the collected data into TFRecordFiles. Each worker returns a namedtuple instance with some statistics about the collection. On the main process I initialize ray as follows:

ray.init(local_mode=self.debug, num_cpus=6, logging_level=logging.DEBUG,
           object_store_memory=object_store_memory, memory=(1024.0 ** 3) * 0.5)

where object_store_memory is simply my own heuristic that however gives more or less the same as the default one. In a loop of n iterations (where n should in theory be possibly infinitely high) I create workers and get the statistics as follows:

split_stats = ray.get([collect.remote(model_representation, self.horizon, self.env_name, self.discount, self.lam, self.tbptt_length, pid) for pid in range(self.workers)])

Now, I have observed that after some number of iterations, my script crashes because there is no more available memory, because the workers take up too much of it. However, what I would expect to happen if together the workers need too much memory in one iteration, it would crash immediately, not after say 200 iterations. I therefore monitored the memory development and indeed memory usage by workers grows with time. Why could this happen? The processes should be absolutely independent of each other, and I expect the memory of a worker to be flushed after finishing its process? The parameters of collect are in order a tuple of a string and a dict with tensorflow model weigths, an integer, a string, a float, another float, an integer and the process id.

I also tried capping the memory with the memory flag in ray.init(), but the memory just grows over it anyways. I have attached examples for this.

Source code / logs

Selection_136 Selection_137

Issue Analytics

State:
Created 4 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

5reactions

ericlcommented, Dec 9, 2019

It’s possible somehow there is a leak since Ray re-uses workers… You can force Ray to recycle the workers after a number of calls with @ray.remote(max_calls=5).

1reaction

eugenevinitskycommented, Dec 20, 2019

@ericl, @richardliaw is there a way to force worker recycling in rllib or tune? I’m wondering if this would solve the issue I am having with memory growth in rllib.