[Bug] Resource demand backlog caps out at 10+ in multi-node K8s cluster
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters
What happened + What you expected to happen
My Setup:
I have deployed a Ray Cluster on Kubernetes with EKS. The cluster has a worker type with 15 CPUs that is always on (min_workers: 1). Other worker types include 7 CPUs, 15 CPUs, 30 CPUs.
I’ve also configured: upscaling_speed=9999
, AUTOSCALER_MAX_LAUNCH_BATCH=9999
, AUTOSCALER_MAX_CONCURRENT_LAUNCHES=9999
as recommended here: https://github.com/ray-project/ray/issues/21683#issuecomment-1018015823
Main Issue:
Whenever I launch 200 tasks, the autoscaler only launches a single 7-cpu worker. After that worker is ready, it launches another single 7-cpu worker. I expect it to launch at least enough workers for 185 CPUs in the first iteration.
In the Ray Operator Pod Logs, I see {'CPU': 1.0}: 1+ pending tasks/actors
. I imagine this should be something like “185+ pending tasks/actors”?
However, I can successfully use ray.autoscaler.sdk.request_resources
to request 200 cpus all at once.
Ray Operator Pod Logs:
kubectl logs -f po/ray-operator-987dc99b9-47v92 | grep Resources -A 10
Resources
---------------------------------------------------------------
Usage:
0.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
(no resource demands)
py38-cu112,karpenter:2022-02-04 10:47:45,337 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.6050498485565186, '10.16.126.217': 0.6049869060516357}\n - NodeIdleSeconds: Min=0 Mean=38219 Max=76437\n - ResourceUsage: 0.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:45,338 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
4.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:51,278 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.5087563991546631, '10.16.126.217': 0.5086848735809326}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 4.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:51,289 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
10.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:47:57,323 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.4549856185913086, '10.16.126.217': 0.4549136161804199}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 10.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:47:57,324 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
15.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
py38-cu112,karpenter:2022-02-04 10:48:03,348 DEBUG gcs_utils.py:245 -- internal_kv_put b'__autoscaling_status_legacy' b"Cluster status: 1 nodes\n - MostDelayedHeartbeats: {'10.16.115.16': 0.54921555519104, '10.16.126.217': 0.5491447448730469}\n - NodeIdleSeconds: Min=0 Mean=0 Max=0\n - ResourceUsage: 15.0/15.0 CPU, 0.0 GiB/25.9 GiB memory, 0.0 GiB/10.25 GiB object_store_memory\n - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0\nWorker node types:\n - wkr-15cpu30g-ondemand: 1" True None
py38-cu112,karpenter:2022-02-04 10:48:03,350 DEBUG legacy_info_string.py:26 -- Cluster status: 1 nodes
--
Resources
---------------------------------------------------------------
Usage:
15.0/15.0 CPU
0.00/25.900 GiB memory
0.00/10.249 GiB object_store_memory
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors
Ray Pod Description:
Name: ray-operator-987dc99b9-47v92
Namespace: karpenter
Priority: 0
Node: ip-10-16-65-175.us-west-2.compute.internal/10.16.65.175
Start Time: Thu, 03 Feb 2022 13:05:01 -0600
Labels: cluster.ray.io/component=operator
pod-template-hash=987dc99b9
Annotations: kubernetes.io/psp: eks.privileged
Status: Running
IP: 10.16.68.93
IPs:
IP: 10.16.68.93
Controlled By: ReplicaSet/ray-operator-987dc99b9
Containers:
ray:
Container ID: docker://f78dc8efee6c1f8483438e325c275a281c310341a1f304f4811aa8c80d263903
Image: rayproject/ray:bbc64e
Image ID: docker-pullable://rayproject/ray@sha256:6a0520c449555a31418ed79188827553e43276db6abc1bc4341edd16dd832eb3
Port: <none>
Host Port: <none>
Command:
ray-operator
State: Running
Started: Thu, 03 Feb 2022 13:05:03 -0600
Ready: True
Restart Count: 0
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 1
ephemeral-storage: 1Gi
memory: 1Gi
Environment:
AUTOSCALER_MAX_NUM_FAILURES: inf
AUTOSCALER_MAX_LAUNCH_BATCH: 9999
AUTOSCALER_MAX_CONCURRENT_LAUNCHES: 9999
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wq2kl (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
kube-api-access-wq2kl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
Versions / Dependencies
Ray Operator Image: rayproject/ray:bbc64e Ray Client Version: 1.9.2 Python 3.8 Kubernetes Version: 1.21 with EKS
Reproduction script
from pprint import pprint
import ray
ray.init("ray://mycluster.internal:10001")
@ray.remote
def task():
import time
time.sleep(30)
pprint(ray.cluster_resources())
results = ray.get([task.remote() for _ in range(200)])
Anything else
This issue occurs every time.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:48 (48 by maintainers)
Top GitHub Comments
You need to wait for the workers to come up before it can be reproduced.
@iycheng after a bit of fiddling I was able to reproduce on a K8s cluster: https://console.anyscale.com/o/anyscale-internal/clusters/ses_ts49ySqkJ1bSxYD7tXuhjWmH
I think the main thing is you need a large number of worker nodes (wait for 20 nodes to start). In that condition, when I ran the following script I saw that the CPUs in the cluster took a long time to become utilized, though eventually they did (and also eventually the demands increased from 1->9k):