Leaking Worker Memory ignores cap set by ray.init(memory=cap)
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18 LTS
- Ray installed from (source or binary): pip installation
- Ray version: 0.76
- Python version: 3.6.8
- Exact command to reproduce: -
Describe the problem
I am running 6 parallel workers collecting data by exploring gym environments, and saving the collected data into TFRecordFiles. Each worker returns a namedtuple instance with some statistics about the collection. On the main process I initialize ray as follows:
ray.init(local_mode=self.debug, num_cpus=6, logging_level=logging.DEBUG,
object_store_memory=object_store_memory, memory=(1024.0 ** 3) * 0.5)
where object_store_memory is simply my own heuristic that however gives more or less the same as the default one. In a loop of n iterations (where n should in theory be possibly infinitely high) I create workers and get the statistics as follows:
split_stats = ray.get([collect.remote(model_representation, self.horizon, self.env_name, self.discount, self.lam, self.tbptt_length, pid) for pid in range(self.workers)])
Now, I have observed that after some number of iterations, my script crashes because there is no more available memory, because the workers take up too much of it. However, what I would expect to happen if together the workers need too much memory in one iteration, it would crash immediately, not after say 200 iterations. I therefore monitored the memory development and indeed memory usage by workers grows with time. Why could this happen? The processes should be absolutely independent of each other, and I expect the memory of a worker to be flushed after finishing its process? The parameters of collect are in order a tuple of a string and a dict with tensorflow model weigths, an integer, a string, a float, another float, an integer and the process id.
I also tried capping the memory with the memory flag in ray.init()
, but the memory just grows over it anyways. I have attached examples for this.
Source code / logs
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (4 by maintainers)
Top GitHub Comments
It’s possible somehow there is a leak since Ray re-uses workers… You can force Ray to recycle the workers after a number of calls with
@ray.remote(max_calls=5)
.@ericl, @richardliaw is there a way to force worker recycling in rllib or tune? I’m wondering if this would solve the issue I am having with memory growth in rllib.