question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Leaking Worker Memory ignores cap set by ray.init(memory=cap)

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 18 LTS
  • Ray installed from (source or binary): pip installation
  • Ray version: 0.76
  • Python version: 3.6.8
  • Exact command to reproduce: -

Describe the problem

I am running 6 parallel workers collecting data by exploring gym environments, and saving the collected data into TFRecordFiles. Each worker returns a namedtuple instance with some statistics about the collection. On the main process I initialize ray as follows:

ray.init(local_mode=self.debug, num_cpus=6, logging_level=logging.DEBUG,
           object_store_memory=object_store_memory, memory=(1024.0 ** 3) * 0.5)

where object_store_memory is simply my own heuristic that however gives more or less the same as the default one. In a loop of n iterations (where n should in theory be possibly infinitely high) I create workers and get the statistics as follows:

split_stats = ray.get([collect.remote(model_representation, self.horizon, self.env_name, self.discount, self.lam, self.tbptt_length, pid) for pid in range(self.workers)])

Now, I have observed that after some number of iterations, my script crashes because there is no more available memory, because the workers take up too much of it. However, what I would expect to happen if together the workers need too much memory in one iteration, it would crash immediately, not after say 200 iterations. I therefore monitored the memory development and indeed memory usage by workers grows with time. Why could this happen? The processes should be absolutely independent of each other, and I expect the memory of a worker to be flushed after finishing its process? The parameters of collect are in order a tuple of a string and a dict with tensorflow model weigths, an integer, a string, a float, another float, an integer and the process id.

I also tried capping the memory with the memory flag in ray.init(), but the memory just grows over it anyways. I have attached examples for this.

Source code / logs

Selection_136 Selection_137

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

5reactions
ericlcommented, Dec 9, 2019

It’s possible somehow there is a leak since Ray re-uses workers… You can force Ray to recycle the workers after a number of calls with @ray.remote(max_calls=5).

1reaction
eugenevinitskycommented, Dec 20, 2019

@ericl, @richardliaw is there a way to force worker recycling in rllib or tune? I’m wondering if this would solve the issue I am having with memory growth in rllib.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Ray Core API — Ray 2.2.0 - the Ray documentation
This specifies the maximum number of times that a given worker can execute the given remote function before it must exit (this can...
Read more >
Memory Management — Ray 1.11.0
Worker heap: memory used by your application (e.g., in Python code or TensorFlow), best measured as the resident set size (RSS) of your ......
Read more >
Out-Of-Memory Prevention — Ray 2.2.0
It periodically checks the memory usage, which includes the worker heap, the object store, and the raylet as described in memory management. If...
Read more >
Help debugging a memory leak in rllib - Ray.io
ray memory also does not show anty obvious leakage at all. All of the workers very slowly (over about 8-12 hours) accumulate memory...
Read more >
Pattern: Using generators to reduce heap memory usage
The key idea is that for tasks that return multiple objects, we can return them one at a time instead of all at...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found