[Bug] Infeasible placement group is not scheduled after the cluster is auto-scaled
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core, Ray Clusters
What happened + What you expected to happen
-
When creating a single placement group (STRICT_PACK) for a node on EC2, it never becomes ready, even though the node comes up (seen in ray monitor). When looking at the placement_group_table, it says INFEASIBLE. Once the node is up, it is possible to remove and re-add the placement group, and it then becomes ready. This only happens with one node/placement_group. With two or more nodes, it works. However, placement_group_table says ‘scheduling_attempt’: 3 once they are up, so something seems a bit fishy.
-
I haven’t been able to get that much out of logs. It just keeps saying:
2021-12-08 02:08:11,690 WARNING resource_demand_scheduler.py:731 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_f603e2de7cd8965a3e9e30f6357dbb7b': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
even though the instance is up.
Versions / Dependencies
ray 1.9.0 and 1.8.0 have given the same results python both 3.8.10 and 3.7.7 have given the same results
Reproduction script
from ray.util.placement_group import placement_group, placement_group_table, remove_placement_group
import ray
ray.init('auto')
bundle = {"CPU": 4, "GPU": 1}
pg = placement_group(
[bundle],
name=f'worker_1',
strategy="STRICT_PACK"
)
Anything else
This happens every time we try so far. Did a minimal yaml for reproduction:
cluster_name: ray-test
max_workers: 2
docker:
image: "rayproject/ray:1.9.0"
container_name: "ray_docker"
pull_before_run: True
provider:
type: aws
region: eu-central-1
availability_zone: eu-central-1a,eu-central-1b
use_internal_ips: True
cache_stopped_nodes: False # If not present, the default is True.
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/<key>
available_node_types:
gpu_ondemand:
node_config:
InstanceType: g4dn.xlarge
ImageId: ami-0bdfa178b5c1fb9e6 # Deep Learning AMI (Ubuntu 18.04)
KeyName: <key-name>
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 150
SubnetIds:
- <subnet1>
- <subnet2>
IamInstanceProfile:
Arn: <ARN>
# Autoscaler will auto fill the CPU/GPU resources below.
resources:
CPU: 4
GPU: 1
max_workers: 15
cpu_ondemand:
node_config:
InstanceType: m5.large
SubnetIds:
- <subnet1>
- <subnet2>
ImageId: ami-0bdfa178b5c1fb9e6 # Deep Learning AMI (Ubuntu 18.04)
KeyName: <key-name>
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 150
max_workers: 0
resources:
CPU: 2
head_node_type: cpu_ondemand
worker_default_node_type: gpu_ondemand
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (6 by maintainers)
@noe-mis This issue must be fixed by https://github.com/ray-project/ray/pull/21993
@dirkweissenborn I believe your issue is the same one as https://github.com/ray-project/ray/issues/21920. Let’s follow up there.
@efokschaner I believe your issue is equivalent to https://github.com/ray-project/ray/issues/18309. I will make a fix for this issue during next week.
I made a smaller repro and it’s unclear if I’m having the same issue as others in here as my tune trial won’t start even if i re-launch it after scaling.
Here’s what my repro looks like:
I’m on ray version 1.9.2
My rllib run set to one worker with 4 cpu per worker.
This results in the following output from tune:
ray monitor
shows the following:I also grabbed the following state from ray in case it helps:
ray.available_resources()
ray.cluster_resources()
ray.util.placement_group_table()
That
NO_RESOURCES
is a bit surprising given that I definitely have 4 + 1 CPU’s available.Let me know if there’s any other information I can provide / any tips on debugging this further.