[Bug] Out of Memory Exception on K8s related to `num_cpu`
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core, Ray Clusters
What happened + What you expected to happen
I have a cluster deployed on K8s, it starts with 1 head node and 3 worker nodes.
When I submit a remote “Task”, it scales up and down perfectly.
Nevertheless, I cannot create Actors without getting Out of Memory Error
for bizarre reasons because the Actors are doing nothing (see code below) and I have 500Gi Memory and 64 CPUs! And that happens just at an early stage (Actor number 33)!
Logs:
[2m[36m(Actor pid=76, ip=10.244.6.33)[0m Actor 0 Created
[2m[36m(Actor pid=102, ip=10.244.6.33)[0m Actor 1 Created
[2m[36m(Actor pid=128, ip=10.244.6.33)[0m Actor 2 Created
[2m[36m(Actor pid=154, ip=10.244.6.33)[0m Actor 3 Created
[2m[36m(Actor pid=180, ip=10.244.6.33)[0m Actor 4 Created
[2m[36m(Actor pid=206, ip=10.244.6.33)[0m Actor 5 Created
[2m[36m(Actor pid=232, ip=10.244.6.33)[0m Actor 6 Created
[2m[36m(Actor pid=258, ip=10.244.6.33)[0m Actor 7 Created
[2m[36m(Actor pid=284, ip=10.244.6.33)[0m Actor 8 Created
[2m[36m(Actor pid=310, ip=10.244.6.33)[0m Actor 9 Created
[2m[36m(Actor pid=336, ip=10.244.6.33)[0m Actor 10 Created
WARNING: 4 PYTHON worker processes have been started on node: 1aa9a939fe504a0c28cde84db4ae4126a4756c44ecf51e0e83b74fde with address: 10.244.6.33. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
[2m[36m(Actor pid=362, ip=10.244.6.33)[0m Actor 11 Created
WARNING: 5 PYTHON worker processes have been started on node: 1aa9a939fe504a0c28cde84db4ae4126a4756c44ecf51e0e83b74fde with address: 10.244.6.33. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
.
.
.
.
.
[2m[36m(Actor pid=936, ip=10.244.6.33)[0m Actor 33 Created
WARNING: 27 PYTHON worker processes have been started on node: 1aa9a939fe504a0c28cde84db4ae4126a4756c44ecf51e0e83b74fde with address: 10.244.6.33. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Traceback (most recent call last):
File "/tmp/tmp.ywColfOwgD", line 67, in <module>
main()
File "/tmp/tmp.ywColfOwgD", line 54, in main
ray.get(handle.ready.remote())
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/client/worker.py", line 359, in get
res = self._get(to_get, op_timeout)
File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/client/worker.py", line 386, in _get
raise err
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, [36mray::35:Actor.__init__[39m (pid=962, ip=10.244.6.33)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ray-worker-6888ddbcd7-24dbv is used (1.92 / 2.0 GB). The top 10 memory consumers are:
PID MEM COMMAND
60 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agen
1 0.06GiB /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --num-cpus=1 --address=xxxx
336 0.05GiB ray::Actor
831 0.05GiB ray::Actor
284 0.05GiB ray::Actor
102 0.05GiB ray::Actor
414 0.05GiB ray::Actor
128 0.05GiB ray::Actor
701 0.05GiB ray::Actor
76 0.05GiB ray::Actor
In addition, up to 0.03 GiB of shared memory is currently being used by the Ray object-store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---
Versions / Dependencies
- Ray 1.9.2
- Python 3.9
- OS: Ubuntu image on K8s.
Reproduction script
# Job to submit a Ray program from a pod outside a running Ray cluster.
apiVersion: batch/v1
kind: Job
metadata:
name: ray-test-job
spec:
template:
spec:
restartPolicy: Never
containers:
- name: xxx
image: ray:1.9.2-py39-gpu
command:
- bash
- -ec
- |
program_path=$(mktemp)
printf "%s" "$0" > "$program_path"
python3 -u "$program_path" "$@"
- |
from collections import Counter
import sys
import time
import ray
""" This script is meant to be run from a pod in the same Kubernetes namespace as your Ray cluster.
"""
@ray.remote(num_cpus=0.01, memory=1024)
class Actor:
def __init__(self, x):
self.x = x
print("Actor {} Created".format(x))
def ready(self):
pass
def main():
print("Iteration {}".format((i+1)))
for x in range(100):
handle = Actor.options(name=str(x + 1), lifetime="detached", namespace="namespace").remote(x)
ray.get(handle.ready.remote())
time.sleep(0.1)
if __name__ == "__main__":
with ray.init("ray://cluster-ray-head:10001"):
main()
resources:
requests:
cpu: 100m
memory: 512Mi
Anything else
Please note the following behaviour:
- using
@ray.remote
for the Actor without passing any arguments is throwing an error (which is very disappointing as I thought one can create an Actor without reserving any CPU ahead of time, then when the time comes, the Actor takes as much as it requires). - Using
@ray.remote(num_cpus=0.5, memory=xxxx)
does not throw the error, regardless of memory used, yet resources run out quickly!
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:18 (6 by maintainers)
Top Results From Across the Web
Kubernetes Logging: Tips to Avoid Memory and Resource ...
Troubleshooting Memory-Related Errors. Setting both memory resource requests and limits for a pod helps Kubernetes manage RAM usage more efficiently. But doing ...
Read more >Kubernetes OOMKilled out of memory diagnosis - Dynatrace
An out-of-memory issue lurks; OOMKilled discovery: High error rates on backend service; Analyzing: Looking for unusual Kubernetes event ...
Read more >How to Prevent 'Out of Memory' Errors in Java-Based ...
One way to avoid out-of-memory errors is by configuring metrics and ... When developing Java applications that run on Kubernetes, one of our ......
Read more >Out-of-memory (OOM) in Kubernetes – Part 2 - Mihai-Albert.com
This is part 2 of a four-part article that looks into what happens in detail when Kubernetes runs into out-of-memory (OOM) situations and ......
Read more >kubadm init error CPUs 1 is less than required 2
kubeadm init --ignore-preflight-errors=NumCPU ... curious about learning more about how CPU cores relate to Kubernetes and Linux Containers, ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Alright, got it, thanks.
so in summary, the solutions/options are as follows:
Memory
as another metric in Ray, along with the number of CPUs (not available for now).Basically this is how scheduling works in ray now.
Each node has resources. Usually they are automatically calculated. Let’s imagine your worker node has 8 cpus
each of actor now then requests cpu. If you request 0.5 cpu, you can schedule up to 16 actors. If it is 0.1, it will be 80.
Ray’s default scheduling policy is hybrid. It doesn’t spread actors as much as possible by default. The default policy is to try packing nodes until half of resources are used.
so let’s imagine you have 2 gb memory and 8 cpus. If you request 0.1 cpu, there could be total 80 actors. Since the default policy is half packing and spreading out, you will create up to 40 actors
if each actor uses 50mb of memory, 50mb * 40 == 2gb, and it is close to the memory limit. Then it can cause the oom.
If you request 0.5 cpu, 8 actors will be created per node, which means the memory usage won’t go more than 40*8 == 400mb.