question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Resource demand backlog caps out at 10+ in multi-node K8s cluster

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

My Setup:

I have deployed a Ray Cluster on Kubernetes with EKS. The cluster has a worker type with 15 CPUs that is always on (min_workers: 1). Other worker types include 7 CPUs, 15 CPUs, 30 CPUs.

I’ve also configured: upscaling_speed=9999, AUTOSCALER_MAX_LAUNCH_BATCH=9999, AUTOSCALER_MAX_CONCURRENT_LAUNCHES=9999 as recommended here: https://github.com/ray-project/ray/issues/21683#issuecomment-1018015823

Main Issue:

Whenever I launch 200 tasks, the autoscaler only launches a single 7-cpu worker. After that worker is ready, it launches another single 7-cpu worker. I expect it to launch at least enough workers for 185 CPUs in the first iteration.

In the Ray Operator Pod Logs, I see {'CPU': 1.0}: 1+ pending tasks/actors. I imagine this should be something like “185+ pending tasks/actors”?

However, I can successfully use ray.autoscaler.sdk.request_resources to request 200 cpus all at once.


Ray Operator Pod Logs: kubectl logs -f po/ray-operator-987dc99b9-47v92 | grep Resources -A 10

Resources
---------------------------------------------------------------
Usage:
 0.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 (no resource demands)
py38-cu112,karpenter:2022-02-04 10:47:45,337    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.6050498485565186, '10.16.126.217': 0.6049869060516357}\n - NodeIdleSeconds: Min=0 Mean=38219 Max=76437\n - ResourceUsage: 0.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:45,338    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 4.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:51,278    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.5087563991546631, '10.16.126.217': 0.5086848735809326}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:51,289    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 10.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:57,323    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.4549856185913086, '10.16.126.217': 0.4549136161804199}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 10.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:57,324    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 15.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:48:03,348    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.54921555519104, '10.16.126.217': 0.5491447448730469}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 15.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:48:03,350    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 15.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors

Ray Pod Description:

Name:         ray-operator-987dc99b9-47v92
Namespace:    karpenter
Priority:     0
Node:         ip-10-16-65-175.us-west-2.compute.internal/10.16.65.175
Start Time:   Thu, 03 Feb 2022 13:05:01 -0600
Labels:       cluster.ray.io/component=operator
              pod-template-hash=987dc99b9
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.16.68.93
IPs:
  IP:           10.16.68.93
Controlled By:  ReplicaSet/ray-operator-987dc99b9
Containers:
  ray:
    Container ID:  docker://f78dc8efee6c1f8483438e325c275a281c310341a1f304f4811aa8c80d263903
    Image:         rayproject/ray:bbc64e
    Image ID:      docker-pullable://rayproject/ray@sha256:6a0520c449555a31418ed79188827553e43276db6abc1bc4341edd16dd832eb3
    Port:          <none>
    Host Port:     <none>
    Command:
      ray-operator
    State:          Running
      Started:      Thu, 03 Feb 2022 13:05:03 -0600
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:                1
      ephemeral-storage:  1Gi
      memory:             1Gi
    Environment:
      AUTOSCALER_MAX_NUM_FAILURES:         inf
      AUTOSCALER_MAX_LAUNCH_BATCH:         9999
      AUTOSCALER_MAX_CONCURRENT_LAUNCHES:  9999
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wq2kl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-wq2kl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Versions / Dependencies

Ray Operator Image: rayproject/ray:bbc64e Ray Client Version: 1.9.2 Python 3.8 Kubernetes Version: 1.21 with EKS

Reproduction script

from pprint import pprint

import ray
ray.init("ray://mycluster.internal:10001")

@ray.remote
def task():
    import time
    time.sleep(30)

pprint(ray.cluster_resources())
results = ray.get([task.remote() for _ in range(200)])

Anything else

This issue occurs every time.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:48 (48 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Feb 24, 2022

You need to wait for the workers to come up before it can be reproduced.

1reaction
ericlcommented, Feb 23, 2022

@iycheng after a bit of fiddling I was able to reproduce on a K8s cluster: https://console.anyscale.com/o/anyscale-internal/clusters/ses_ts49ySqkJ1bSxYD7tXuhjWmH

I think the main thing is you need a large number of worker nodes (wait for 20 nodes to start). In that condition, when I ran the following script I saw that the CPUs in the cluster took a long time to become utilized, though eventually they did (and also eventually the demands increased from 1->9k):

import ray
import time

@ray.remote
def foo():
      import time
      time.sleep(999)

ray.get([foo.remote() for _ in range(10000)])
Read more comments on GitHub >

github_iconTop Results From Across the Web

Kubernetes Logging: Tips to Avoid Memory and Resource ...
Kubernetes will try to distribute RAM between all running pods equally, but if one pod tries to allocate more and more memory, Kubernetes...
Read more >
Add and remove nodes in your cluster | Elasticsearch Guide [8.5]
As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes...
Read more >
Set up a multi node Kubernetes cluster locally - using KinD
Set up a multi node Kubernetes cluster locally - using KinD. Local Kubernetes playground using KIND (Kubernetes in Docker). source: https ...
Read more >
Troubleshooting - MicroK8s
MicroK8s is the simplest production-grade upstream K8s. Lightweight and focused. Single command install on Linux, Windows and macOS.
Read more >
Insufficient cpu in Kubernetes multi node cluster - Stack Overflow
A Pod is scheduled on a single Node. The resource requests: help decide where it can be scheduled. If you say requests: {cpu:...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found