question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Out of Memory Exception on K8s related to `num_cpu`

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core, Ray Clusters

What happened + What you expected to happen

I have a cluster deployed on K8s, it starts with 1 head node and 3 worker nodes.

When I submit a remote “Task”, it scales up and down perfectly.

Nevertheless, I cannot create Actors without getting Out of Memory Error for bizarre reasons because the Actors are doing nothing (see code below) and I have 500Gi Memory and 64 CPUs! And that happens just at an early stage (Actor number 33)!

Logs:

(Actor pid=76, ip=10.244.6.33) Actor 0 Created
(Actor pid=102, ip=10.244.6.33) Actor 1 Created
(Actor pid=128, ip=10.244.6.33) Actor 2 Created
(Actor pid=154, ip=10.244.6.33) Actor 3 Created
(Actor pid=180, ip=10.244.6.33) Actor 4 Created
(Actor pid=206, ip=10.244.6.33) Actor 5 Created
(Actor pid=232, ip=10.244.6.33) Actor 6 Created
(Actor pid=258, ip=10.244.6.33) Actor 7 Created
(Actor pid=284, ip=10.244.6.33) Actor 8 Created
(Actor pid=310, ip=10.244.6.33) Actor 9 Created
(Actor pid=336, ip=10.244.6.33) Actor 10 Created
WARNING: 4 PYTHON worker processes have been started on node: 1aa9a939fe504a0c28cde84db4ae4126a4756c44ecf51e0e83b74fde with address: 10.244.6.33. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
(Actor pid=362, ip=10.244.6.33) Actor 11 Created
WARNING: 5 PYTHON worker processes have been started on node: 1aa9a939fe504a0c28cde84db4ae4126a4756c44ecf51e0e83b74fde with address: 10.244.6.33. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
.
.
.
.
.
(Actor pid=936, ip=10.244.6.33) Actor 33 Created
WARNING: 27 PYTHON worker processes have been started on node: 1aa9a939fe504a0c28cde84db4ae4126a4756c44ecf51e0e83b74fde with address: 10.244.6.33. This could be a result of using a large number of actors, or due to tasks blocked in ray.get() calls (see https://github.com/ray-project/ray/issues/3644 for some discussion of workarounds).
Traceback (most recent call last):
  File "/tmp/tmp.ywColfOwgD", line 67, in <module>
    main()
  File "/tmp/tmp.ywColfOwgD", line 54, in main
    ray.get(handle.ready.remote())
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/client/worker.py", line 359, in get
    res = self._get(to_get, op_timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/util/client/worker.py", line 386, in _get
    raise err
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::35:Actor.__init__ (pid=962, ip=10.244.6.33)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ray-worker-6888ddbcd7-24dbv is used (1.92 / 2.0 GB). The top 10 memory consumers are:

PID	MEM	COMMAND
60	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.9/site-packages/ray/dashboard/agen
1	0.06GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/ray start --num-cpus=1 --address=xxxx
336	0.05GiB	ray::Actor
831	0.05GiB	ray::Actor
284	0.05GiB	ray::Actor
102	0.05GiB	ray::Actor
414	0.05GiB	ray::Actor
128	0.05GiB	ray::Actor
701	0.05GiB	ray::Actor
76	0.05GiB	ray::Actor

In addition, up to 0.03 GiB of shared memory is currently being used by the Ray object-store.
---
--- Tip: Use the `ray memory` command to list active objects in the cluster.
--- To disable OOM exceptions, set RAY_DISABLE_MEMORY_MONITOR=1.
---

Versions / Dependencies

  • Ray 1.9.2
  • Python 3.9
  • OS: Ubuntu image on K8s.

Reproduction script

# Job to submit a Ray program from a pod outside a running Ray cluster.
apiVersion: batch/v1
kind: Job
metadata:
  name: ray-test-job
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: xxx
          image: ray:1.9.2-py39-gpu
          command:
          - bash
          - -ec
          - |
            program_path=$(mktemp)
            printf "%s" "$0" > "$program_path"
            python3 -u "$program_path" "$@"
          - |
            from collections import Counter
            import sys
            import time
            import ray
            """ This script is meant to be run from a pod in the same Kubernetes namespace as your Ray cluster.
            """
            @ray.remote(num_cpus=0.01, memory=1024)
            class Actor:
              def __init__(self, x):
                self.x = x
                print("Actor {} Created".format(x))

              def ready(self):
                pass

            def main():
                print("Iteration {}".format((i+1)))
                for x in range(100):
                     handle = Actor.options(name=str(x + 1), lifetime="detached", namespace="namespace").remote(x)
                     ray.get(handle.ready.remote())
                     time.sleep(0.1)


            if __name__ == "__main__":
                with ray.init("ray://cluster-ray-head:10001"):
                  main()

          resources:
            requests:
              cpu: 100m
              memory: 512Mi

Anything else

Please note the following behaviour:

  1. using @ray.remote for the Actor without passing any arguments is throwing an error (which is very disappointing as I thought one can create an Actor without reserving any CPU ahead of time, then when the time comes, the Actor takes as much as it requires).
  2. Using @ray.remote(num_cpus=0.5, memory=xxxx) does not throw the error, regardless of memory used, yet resources run out quickly!

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:18 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
John-Almardenycommented, Feb 4, 2022

Alright, got it, thanks.

so in summary, the solutions/options are as follows:

  1. The optimal future solution is to include Memory as another metric in Ray, along with the number of CPUs (not available for now).
  2. Make the threshold of the HPA memory metric <50%, so it can scale up as early as possible when less than half of the memory is reserved, per worker node.
  3. If you need to keep the CPU of each worker node at 1 or so, then you need to noticeably increase the Memory allocated for that node because both work hand in hand. This can be achieved by simple calculations.
  4. Keep a reasonable memory allocation per node, but increase the CPU allocated for each Actor, in order not to end up with many Actors at each worker node.
1reaction
rkooo567commented, Feb 4, 2022

Basically this is how scheduling works in ray now.

Each node has resources. Usually they are automatically calculated. Let’s imagine your worker node has 8 cpus

each of actor now then requests cpu. If you request 0.5 cpu, you can schedule up to 16 actors. If it is 0.1, it will be 80.

Ray’s default scheduling policy is hybrid. It doesn’t spread actors as much as possible by default. The default policy is to try packing nodes until half of resources are used.

so let’s imagine you have 2 gb memory and 8 cpus. If you request 0.1 cpu, there could be total 80 actors. Since the default policy is half packing and spreading out, you will create up to 40 actors

if each actor uses 50mb of memory, 50mb * 40 == 2gb, and it is close to the memory limit. Then it can cause the oom.

If you request 0.5 cpu, 8 actors will be created per node, which means the memory usage won’t go more than 40*8 == 400mb.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Kubernetes Logging: Tips to Avoid Memory and Resource ...
Troubleshooting Memory-Related Errors. Setting both memory resource requests and limits for a pod helps Kubernetes manage RAM usage more efficiently. But doing ...
Read more >
Kubernetes OOMKilled out of memory diagnosis - Dynatrace
An out-of-memory issue lurks; OOMKilled discovery: High error rates on backend service; Analyzing: Looking for unusual Kubernetes event ...
Read more >
How to Prevent 'Out of Memory' Errors in Java-Based ...
One way to avoid out-of-memory errors is by configuring metrics and ... When developing Java applications that run on Kubernetes, one of our ......
Read more >
Out-of-memory (OOM) in Kubernetes – Part 2 - Mihai-Albert.com
This is part 2 of a four-part article that looks into what happens in detail when Kubernetes runs into out-of-memory (OOM) situations and ......
Read more >
kubadm init error CPUs 1 is less than required 2
kubeadm init --ignore-preflight-errors=NumCPU ... curious about learning more about how CPU cores relate to Kubernetes and Linux Containers, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found