question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] 1.9.0 PlacementGroup scheduling

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

This is not a contribution.

With the requirement of using PlacementGroup in 1.9.0 for Tune, we’ve noticed that cloud jobs run with custom resources and PlacementGroupFactory() fail to schedule all the actors and just hang indefinitely. Note that it is possible this bug was present before 1.9.0, as we were not using placement groups everywhere before.

In our logs we see warnings like this

2022-01-26 01:00:53,234    WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffffef9a401d7616b4423d86795701000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_9ec05ccf6e0b6dc761548df83a8f300f: 1.000000}, {worker_group_9ec05ccf6e0b6dc761548df83a8f300f: 1.000000}, {memory_group_9ec05ccf6e0b6dc761548df83a8f300f: 61440000.000000 GiB}
Available resources on this node: {30.000000/36.000000 CPU, 2007361660.009766 GiB/2335041660.009766 GiB memory, 1000732139.990234 GiB/1000732139.990234 GiB object_store_memory, 204800000.000000 GiB/327680000.000000 GiB memory_group_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 worker1, 1.000000/1.000000 node:172.29.255.62, 1000.000000/1000.000000 bundle_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 0.000000/6.000000 worker_group_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_38_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 0.000000/6.000000 CPU_group_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_38_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 20480000.000000 GiB/20480000.000000 GiB memory_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_38_9ec05ccf6e0b6dc761548df83a8f300f, 994.000000/1000.000000 worker, 61440000.000000 GiB/61440000.000000 GiB memory_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 6000.000000/6000.000000 bundle_group_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_38_9ec05ccf6e0b6dc761548df83a8f300f}
 In total there are 0 pending tasks and 4 pending actors on this node.

Versions / Dependencies

ray==1.9.0

Reproduction script

We are able to reproduce this with a cluster consisting of two c5.9xlarge nodes.

Below we have included a code snippet for pg_bug.py used to reproduce the bug and a description of our cluster setup. The script is expected to raise the Exception very quickly if all actors are scheduled properly. Instead, it hangs indefinitely.

# pg_bug.py
import time

import ray
from ray import tune


@ray.remote
class Worker:
    def work(self):
        pass


# Memory specs of two types of workers
mem_per_worker = 400 * 1024**2  # 400 MB
mem_per_hm_worker = 3 * mem_per_worker


class Trainable(tune.Trainable):
    def setup(self, config):
        worker_cls = Worker.options(memory=mem_per_worker, resources={'worker': 1})
        hm_worker_cls = Worker.options(memory=mem_per_hm_worker, resources={'worker': 1})
        self.workers = [worker_cls.remote() for _ in range(config['num_workers'])]
        self.hm_workers = [hm_worker_cls.remote() for _ in range(config['num_hm_workers'])]

    def step(self):
        ray.get([worker.work.remote() for worker in self.hm_workers + self.workers])
        raise Exception("If this exception is raised, scheduling did not hang!")

    @classmethod
    def default_resource_request(cls, config):
        bundles = ([{
            'CPU': 1
        }] + [{
            'CPU': 1,
            'memory': mem_per_worker,
            'worker': 1
        }] * config['num_workers'] + [{
            'CPU': 1,
            'memory': mem_per_hm_worker,
            'worker': 1
        }] * config['num_hm_workers'])
        return tune.PlacementGroupFactory(bundles, strategy='PACK')


if __name__ == "__main__":
    ray.init(address="auto")
    while ray.cluster_resources()['CPU'] < 72:
        time.sleep(5)  # Wait for the other node to come online.
    print("STARTING TEST")
    tune.run(
        Trainable,
        stop={'training_iteration': 1000000},
        config={
            'num_workers': 36,
            'num_hm_workers': 5
        },
    )

On the head node, we run

ray start --head --num-cpus=`nproc --all` --resources='{"worker": 1000}' && python3 pg_bug.py

On the other node, we run

ray start --address=$REDIS_SERVER_ADDRESS --num-cpus=`nproc --all` --resources='{"worker": 1000}' 

where REDIS_SERVER_ADDRESS is set to <head node hostname>:6379. We’ve omitted details specific to launching jobs and setting up the cluster on our internal infrastructure, but hopefully the above instructions are sufficient to reproduce the issue.

Anything else

In this setup, each c5.9xlarge instance has 36 CPUs and 72GB of RAM. Even accounting for object store memory, there should be plenty of resource left in terms of both CPU and memory to run this job. Also note that while the use of custom resource “worker” seems unnecessary in this case, we found it is a necessary part of the reproduction.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:3
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
soundwaycommented, Feb 2, 2022

I think we’ll have to create a custom resource hm_worker, and then use them instead for hm_workers exclusively so workers will not be scheduled on the hm_worker bundles.

So basically, with PG, if any actor’s resource requirement is a subset of another, then we’ll run the risk of scheduling failures unless we create actors in a very specific order (in this case if we create hm_worker first then things should work). While I understand that users could also get around this with adding unique custom resource for each type of actors and that there’s probably not an easy way on the developer’s end to fix it, this behavior is not immediately intuitive, and it creates a problem that doesn’t exist when PG is not used. A more informative error message could be useful here.

1reaction
kiralazcommented, Jan 27, 2022

Sure, here’s the output from ray status on my end

$ ray status
======== Autoscaler status: 2022-01-27 16:52:21.209356 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_49fb86a57d5b87c285bdaa6bfdefd4086f807b949ec8f4bc0c29512f
 1 node_16527f5ec08e4623f4d625129df2682cf21567663dc34aff7b3e521d
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 38.0/72.0 CPU (38.0 used of 42.0 reserved in placement groups)
 15.23/82.733 GiB memory (15.23 used of 19.92 GiB reserved in placement groups)
 0.00/38.185 GiB object_store_memory
 37.0/2000.0 worker (37.0 used of 41.0 reserved in placement groups)

Demands:
 {'CPU': 1.0, 'memory': 1258291200.0, 'worker': 1.0}: 4+ pending tasks/actors (4+ using placement groups)

I guess it seems there are 4 of the hm_workers that somehow aren’t being scheduled?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Placement groups - Amazon Elastic Compute Cloud
A cluster placement group can span peered virtual private networks (VPCs) in the same Region. Instances in the same cluster placement group enjoy...
Read more >
Bug listing with status RESOLVED with resolution UPSTREAM ...
Bug :6292 - "loadkeys broken, or kernel memory garbled!!! ... Bug:29741 - "shadow: useradd default extra groups" status:RESOLVED resolution:UPSTREAM ...
Read more >
Airflow scheduler does not appear to be running after execute ...
It will use SequentialExecutor and sqlite by default, and it will have this "The scheduler does not appear to be running." error. To...
Read more >
Azure PowerShell release notes - Microsoft Learn
Fixed a bug that does not update firewall policy application, ... Azure Site Recovery support for Proximity placement group for VMware to ...
Read more >
Placement Groups — Ray 2.2.0
Placement Groups # · Gang Scheduling: Your application requires all tasks/actors to be scheduled and start at the same time. · Maximizing data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found