Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Resource demand backlog caps out at 10+ in multi-node K8s cluster

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

My Setup:

I have deployed a Ray Cluster on Kubernetes with EKS. The cluster has a worker type with 15 CPUs that is always on (min_workers: 1). Other worker types include 7 CPUs, 15 CPUs, 30 CPUs.

I’ve also configured: upscaling_speed=9999, AUTOSCALER_MAX_LAUNCH_BATCH=9999, AUTOSCALER_MAX_CONCURRENT_LAUNCHES=9999 as recommended here: https://github.com/ray-project/ray/issues/21683#issuecomment-1018015823

Main Issue:

Whenever I launch 200 tasks, the autoscaler only launches a single 7-cpu worker. After that worker is ready, it launches another single 7-cpu worker. I expect it to launch at least enough workers for 185 CPUs in the first iteration.

In the Ray Operator Pod Logs, I see {'CPU': 1.0}: 1+ pending tasks/actors. I imagine this should be something like “185+ pending tasks/actors”?

However, I can successfully use ray.autoscaler.sdk.request_resources to request 200 cpus all at once.

Ray Operator Pod Logs: kubectl logs -f po/ray-operator-987dc99b9-47v92 | grep Resources -A 10

Resources
---------------------------------------------------------------
Usage:
 0.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 (no resource demands)
py38-cu112,karpenter:2022-02-04 10:47:45,337    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.6050498485565186, '10.16.126.217': 0.6049869060516357}\n - NodeIdleSeconds: Min=0 Mean=38219 Max=76437\n - ResourceUsage: 0.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:45,338    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 4.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:51,278    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.5087563991546631, '10.16.126.217': 0.5086848735809326}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:51,289    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 10.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:57,323    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.4549856185913086, '10.16.126.217': 0.4549136161804199}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 10.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:57,324    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 15.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:48:03,348    DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.54921555519104, '10.16.126.217': 0.5491447448730469}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 15.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:48:03,350    DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
 15.0/15.0 CPU
 0.00/25.900 GiB memory
 0.00/10.249 GiB object_store_memory

Demands:
 {'CPU': 1.0}: 1+ pending tasks/actors

Ray Pod Description:

Name:         ray-operator-987dc99b9-47v92
Namespace:    karpenter
Priority:     0
Node:         ip-10-16-65-175.us-west-2.compute.internal/10.16.65.175
Start Time:   Thu, 03 Feb 2022 13:05:01 -0600
Labels:       cluster.ray.io/component=operator
              pod-template-hash=987dc99b9
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Running
IP:           10.16.68.93
IPs:
  IP:           10.16.68.93
Controlled By:  ReplicaSet/ray-operator-987dc99b9
Containers:
  ray:
    Container ID:  docker://f78dc8efee6c1f8483438e325c275a281c310341a1f304f4811aa8c80d263903
    Image:         rayproject/ray:bbc64e
    Image ID:      docker-pullable://rayproject/ray@sha256:6a0520c449555a31418ed79188827553e43276db6abc1bc4341edd16dd832eb3
    Port:          <none>
    Host Port:     <none>
    Command:
      ray-operator
    State:          Running
      Started:      Thu, 03 Feb 2022 13:05:03 -0600
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:                1
      ephemeral-storage:  1Gi
      memory:             1Gi
    Environment:
      AUTOSCALER_MAX_NUM_FAILURES:         inf
      AUTOSCALER_MAX_LAUNCH_BATCH:         9999
      AUTOSCALER_MAX_CONCURRENT_LAUNCHES:  9999
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wq2kl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-wq2kl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Versions / Dependencies

Ray Operator Image: rayproject/ray:bbc64e Ray Client Version: 1.9.2 Python 3.8 Kubernetes Version: 1.21 with EKS

Reproduction script

from pprint import pprint

import ray
ray.init("ray://mycluster.internal:10001")

@ray.remote
def task():
    import time
    time.sleep(30)

pprint(ray.cluster_resources())
results = ray.get([task.remote() for _ in range(200)])

Anything else

This issue occurs every time.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:48 (48 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Feb 24, 2022

You need to wait for the workers to come up before it can be reproduced.

1reaction

ericlcommented, Feb 23, 2022

@iycheng after a bit of fiddling I was able to reproduce on a K8s cluster: https://console.anyscale.com/o/anyscale-internal/clusters/ses_ts49ySqkJ1bSxYD7tXuhjWmH

I think the main thing is you need a large number of worker nodes (wait for 20 nodes to start). In that condition, when I ran the following script I saw that the CPUs in the cluster took a long time to become utilized, though eventually they did (and also eventually the demands increased from 1->9k):

import ray
import time

@ray.remote
def foo():
      import time
      time.sleep(999)

ray.get([foo.remote() for _ in range(10000)])

Top Results From Across the Web

Kubernetes Logging: Tips to Avoid Memory and Resource ...

Kubernetes will try to distribute RAM between all running pods equally, but if one pod tries to allocate more and more memory, Kubernetes...

Add and remove nodes in your cluster | Elasticsearch Guide [8.5]

As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes...

Set up a multi node Kubernetes cluster locally - using KinD

Set up a multi node Kubernetes cluster locally - using KinD. Local Kubernetes playground using KIND (Kubernetes in Docker). source: https ...

Troubleshooting - MicroK8s

MicroK8s is the simplest production-grade upstream K8s. Lightweight and focused. Single command install on Linux, Windows and macOS.

Insufficient cpu in Kubernetes multi node cluster - Stack Overflow

A Pod is scheduled on a single Node. The resource requests: help decide where it can be scheduled. If you say requests: {cpu:...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[Bug] Resource demand backlog caps out at 10+ in multi-node K8s cluster

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[Feature] [RLlib] Pytorch JIT as framework

Ray Train / Ray Tune fails with Matplotlib