Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Infeasible placement group is not scheduled after the cluster is auto-scaled

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core, Ray Clusters

What happened + What you expected to happen

When creating a single placement group (STRICT_PACK) for a node on EC2, it never becomes ready, even though the node comes up (seen in ray monitor). When looking at the placement_group_table, it says INFEASIBLE. Once the node is up, it is possible to remove and re-add the placement group, and it then becomes ready. This only happens with one node/placement_group. With two or more nodes, it works. However, placement_group_table says ‘scheduling_attempt’: 3 once they are up, so something seems a bit fishy.
I haven’t been able to get that much out of logs. It just keeps saying: 2021-12-08 02:08:11,690 WARNING resource_demand_scheduler.py:731 -- The autoscaler could not find a node type to satisfy the request: [{'bundle_group_f603e2de7cd8965a3e9e30f6357dbb7b': 0.001}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling. even though the instance is up.

Versions / Dependencies

ray 1.9.0 and 1.8.0 have given the same results python both 3.8.10 and 3.7.7 have given the same results

Reproduction script

from ray.util.placement_group import placement_group, placement_group_table, remove_placement_group
import ray

ray.init('auto')

bundle = {"CPU": 4, "GPU": 1}
pg = placement_group(
    [bundle],
    name=f'worker_1',
    strategy="STRICT_PACK"
)

Anything else

This happens every time we try so far. Did a minimal yaml for reproduction:

cluster_name: ray-test
max_workers: 2

docker:
    image: "rayproject/ray:1.9.0"
    container_name: "ray_docker"
    pull_before_run: True

provider:
    type: aws
    region: eu-central-1
    availability_zone: eu-central-1a,eu-central-1b
    use_internal_ips: True
    cache_stopped_nodes: False # If not present, the default is True.


auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/<key>

available_node_types:
    gpu_ondemand:
        node_config:
            InstanceType: g4dn.xlarge
            ImageId: ami-0bdfa178b5c1fb9e6 # Deep Learning AMI (Ubuntu 18.04)
            KeyName: <key-name>
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 150
            SubnetIds:
                - <subnet1>
                - <subnet2>
            IamInstanceProfile:
                Arn: <ARN>
        # Autoscaler will auto fill the CPU/GPU resources below.
        resources:
            CPU: 4
            GPU: 1
        max_workers: 15
    cpu_ondemand:
        node_config:
            InstanceType: m5.large
            SubnetIds:
                - <subnet1>
                - <subnet2>
            ImageId: ami-0bdfa178b5c1fb9e6 # Deep Learning AMI (Ubuntu 18.04)
            KeyName: <key-name>
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 150
        max_workers: 0
        resources:
            CPU: 2

head_node_type: cpu_ondemand
worker_default_node_type: gpu_ondemand

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:14 (6 by maintainers)

Top GitHub Comments

2reactions

rkooo567commented, Jan 31, 2022

@noe-mis This issue must be fixed by https://github.com/ray-project/ray/pull/21993

@dirkweissenborn I believe your issue is the same one as https://github.com/ray-project/ray/issues/21920. Let’s follow up there.

@efokschaner I believe your issue is equivalent to https://github.com/ray-project/ray/issues/18309. I will make a fix for this issue during next week.

2reactions

efokschanercommented, Jan 17, 2022

I made a smaller repro and it’s unclear if I’m having the same issue as others in here as my tune trial won’t start even if i re-launch it after scaling.

Here’s what my repro looks like:

I’m on ray version 1.9.2

1 x 4 CPU head node (2 CPU and 2 custom “HEAD_CPU” advertised as resources).
1 x 4 CPU worker node,

My rllib run set to one worker with 4 cpu per worker.

This results in the following output from tune:

(ray_main pid=595) == Status ==
(ray_main pid=595) Current time: 2022-01-17 11:46:36 (running for 00:03:40.19)
(ray_main pid=595) Memory usage on this node: 1.4/15.6 GiB
(ray_main pid=595) Using FIFO scheduling algorithm.
(ray_main pid=595) Resources requested: 0/6 CPUs, 0/0 GPUs, 0.0/11.97 GiB heap, 0.0/5.8 GiB objects (0.0/2.0 HEAD_CPU)
(ray_main pid=595) Result logdir: /home/ray/ray_results/appo_training
(ray_main pid=595) Number of trials: 1/1 (1 PENDING)
(ray_main pid=595) +--------------------------+----------+-------+
(ray_main pid=595) | Trial name               | status   | loc   |
(ray_main pid=595) |--------------------------+----------+-------|
(ray_main pid=595) | APPO_our_env_ab1dd_00000 | PENDING  |       |
(ray_main pid=595) +--------------------------+----------+-------+

ray monitor shows the following:

======== Autoscaler status: 2022-01-17 11:46:11.370168 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 1 ray_worker_small
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/6.0 CPU
 1.0/2.0 HEAD_CPU
 0.00/11.970 GiB memory
 0.00/5.797 GiB object_store_memory

Demands:
 {'CPU': 1.0} * 1, {'CPU': 4.0} * 1 (PACK): 1+ pending placement groups

I also grabbed the following state from ray in case it helps:

ray.available_resources()

{'CPU': 6.0,
 'HEAD_CPU': 1.0,
 'memory': 12852385794.0,
 'node:10.182.0.63': 1.0,
 'node:10.182.15.192': 1.0,
 'object_store_memory': 6224792213.0}

ray.cluster_resources()

{'CPU': 6.0,
'HEAD_CPU': 2.0,
'memory': 12852385794.0,
'node:10.182.0.63': 1.0,
'node:10.182.15.192': 1.0,
'object_store_memory': 6224797285.0}

ray.util.placement_group_table()

{'23e6d966b7b3748eacf8d0dc19fd54a5': {'bundles': {0: {'CPU': 1.0},
                                                  1: {'CPU': 4.0}},
                                      'name': '__tune_83e72cbd__a565778b',
                                      'placement_group_id': '23e6d966b7b3748eacf8d0dc19fd54a5',
                                      'state': 'REMOVED',
                                      'stats': {'end_to_end_creation_latency_ms': 0.0,
                                                'highest_retry_delay_ms': 1000.0,
                                                'scheduling_attempt': 236,
                                                'scheduling_latency_ms': 0.0,
                                                'scheduling_state': 'REMOVED'},
                                      'strategy': 'PACK'},
 '23f1bf68e9a140e0e8326ba9bf60c573': {'bundles': {0: {'CPU': 1.0},
                                                  1: {'CPU': 4.0}},
                                      'name': '__tune_3f706cb3__e2694e22',
                                      'placement_group_id': '23f1bf68e9a140e0e8326ba9bf60c573',
                                      'state': 'PENDING',
                                      'stats': {'end_to_end_creation_latency_ms': 0.0,
                                                'highest_retry_delay_ms': 1000.0,
                                                'scheduling_attempt': 365,
                                                'scheduling_latency_ms': 0.0,
                                                'scheduling_state': 'NO_RESOURCES'},
                                      'strategy': 'PACK'}}

That NO_RESOURCES is a bit surprising given that I definitely have 4 + 1 CPU’s available.

Let me know if there’s any other information I can provide / any tips on debugging this further.

Top Results From Across the Web

Placement Groups — Ray 2.2.0

After the placement group is created, tasks or actors can be then scheduled according to the placement group and even on individual bundles....

Troubleshoot scaling issues with an Amazon EC2 Auto ...

Amazon EC2 Auto Scaling doesn't show an error message in the group's activity history when scaling is blocked due to minimum or maximum ......

Autoscaling clusters | Dataproc Documentation - Google Cloud

Dataproc autoscaling supports horizontal scaling (scaling the number of nodes) not vertical scaling (scaling machine types). When to use autoscaling. Use ...

Optimized container scheduling for data-intensive serverless ...

Skippy interfaces with existing container orchestration systems like Kubernetes, that were not designed for edge computing scenarios, and makes them sensitive ...

Major Changes in Solr 9 :: Apache Solr Reference Guide

Otherwise, SolrJ will not be able to connect to the cluster once it has upgraded ... A rolling upgrade from Solr 8 to...