[Bug] 1.9.0 PlacementGroup scheduling
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
This is not a contribution.
With the requirement of using PlacementGroup in 1.9.0 for Tune, we’ve noticed that cloud jobs run with custom resources and PlacementGroupFactory() fail to schedule all the actors and just hang indefinitely. Note that it is possible this bug was present before 1.9.0, as we were not using placement groups everywhere before.
In our logs we see warnings like this
2022-01-26 01:00:53,234 WARNING worker.py:1245 -- The actor or task with ID ffffffffffffffffef9a401d7616b4423d86795701000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_9ec05ccf6e0b6dc761548df83a8f300f: 1.000000}, {worker_group_9ec05ccf6e0b6dc761548df83a8f300f: 1.000000}, {memory_group_9ec05ccf6e0b6dc761548df83a8f300f: 61440000.000000 GiB}
Available resources on this node: {30.000000/36.000000 CPU, 2007361660.009766 GiB/2335041660.009766 GiB memory, 1000732139.990234 GiB/1000732139.990234 GiB object_store_memory, 204800000.000000 GiB/327680000.000000 GiB memory_group_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 worker1, 1.000000/1.000000 node:172.29.255.62, 1000.000000/1000.000000 bundle_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 0.000000/6.000000 worker_group_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_37_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_38_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_39_9ec05ccf6e0b6dc761548df83a8f300f, 0.000000/6.000000 CPU_group_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 1000.000000/1000.000000 bundle_group_38_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 20480000.000000 GiB/20480000.000000 GiB memory_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_38_9ec05ccf6e0b6dc761548df83a8f300f, 994.000000/1000.000000 worker, 61440000.000000 GiB/61440000.000000 GiB memory_group_40_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 worker_group_36_9ec05ccf6e0b6dc761548df83a8f300f, 6000.000000/6000.000000 bundle_group_9ec05ccf6e0b6dc761548df83a8f300f, 1.000000/1.000000 CPU_group_41_9ec05ccf6e0b6dc761548df83a8f300f, 61440000.000000 GiB/61440000.000000 GiB memory_group_38_9ec05ccf6e0b6dc761548df83a8f300f}
In total there are 0 pending tasks and 4 pending actors on this node.
Versions / Dependencies
ray==1.9.0
Reproduction script
We are able to reproduce this with a cluster consisting of two c5.9xlarge nodes.
Below we have included a code snippet for pg_bug.py used to reproduce the bug and a description of our cluster setup.
The script is expected to raise the Exception
very quickly if all actors are scheduled properly. Instead, it hangs indefinitely.
# pg_bug.py
import time
import ray
from ray import tune
@ray.remote
class Worker:
def work(self):
pass
# Memory specs of two types of workers
mem_per_worker = 400 * 1024**2 # 400 MB
mem_per_hm_worker = 3 * mem_per_worker
class Trainable(tune.Trainable):
def setup(self, config):
worker_cls = Worker.options(memory=mem_per_worker, resources={'worker': 1})
hm_worker_cls = Worker.options(memory=mem_per_hm_worker, resources={'worker': 1})
self.workers = [worker_cls.remote() for _ in range(config['num_workers'])]
self.hm_workers = [hm_worker_cls.remote() for _ in range(config['num_hm_workers'])]
def step(self):
ray.get([worker.work.remote() for worker in self.hm_workers + self.workers])
raise Exception("If this exception is raised, scheduling did not hang!")
@classmethod
def default_resource_request(cls, config):
bundles = ([{
'CPU': 1
}] + [{
'CPU': 1,
'memory': mem_per_worker,
'worker': 1
}] * config['num_workers'] + [{
'CPU': 1,
'memory': mem_per_hm_worker,
'worker': 1
}] * config['num_hm_workers'])
return tune.PlacementGroupFactory(bundles, strategy='PACK')
if __name__ == "__main__":
ray.init(address="auto")
while ray.cluster_resources()['CPU'] < 72:
time.sleep(5) # Wait for the other node to come online.
print("STARTING TEST")
tune.run(
Trainable,
stop={'training_iteration': 1000000},
config={
'num_workers': 36,
'num_hm_workers': 5
},
)
On the head node, we run
ray start --head --num-cpus=`nproc --all` --resources='{"worker": 1000}' && python3 pg_bug.py
On the other node, we run
ray start --address=$REDIS_SERVER_ADDRESS --num-cpus=`nproc --all` --resources='{"worker": 1000}'
where REDIS_SERVER_ADDRESS
is set to <head node hostname>:6379
. We’ve omitted details specific to launching jobs and setting up the cluster on our internal infrastructure, but hopefully the above instructions are sufficient to reproduce the issue.
Anything else
In this setup, each c5.9xlarge instance has 36 CPUs and 72GB of RAM. Even accounting for object store memory, there should be plenty of resource left in terms of both CPU and memory to run this job. Also note that while the use of custom resource “worker” seems unnecessary in this case, we found it is a necessary part of the reproduction.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:3
- Comments:13 (10 by maintainers)
Top GitHub Comments
I think we’ll have to create a custom resource
hm_worker
, and then use them instead for hm_workers exclusively so workers will not be scheduled on the hm_worker bundles.So basically, with PG, if any actor’s resource requirement is a subset of another, then we’ll run the risk of scheduling failures unless we create actors in a very specific order (in this case if we create hm_worker first then things should work). While I understand that users could also get around this with adding unique custom resource for each type of actors and that there’s probably not an easy way on the developer’s end to fix it, this behavior is not immediately intuitive, and it creates a problem that doesn’t exist when PG is not used. A more informative error message could be useful here.
Sure, here’s the output from
ray status
on my endI guess it seems there are 4 of the
hm_workers
that somehow aren’t being scheduled?