question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avoiding the `pending and cannot currently be scheduled` warning

See original GitHub issue

Problem

It’s often easier to ‘fire and forget’ for starting actors when resources are not available (and waiting for Ray to autoscale). However, we often see this nefarious, notorious error message:

2020-01-13 18:15:09,447	WARNING worker.py:1062 -- The actor or task with ID ffffffffffffffff45b95b1c0100 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {node:172.22.225.108: 1.000000}, {CPU: 2.000000}, {memory: 16.113281 GiB}, {GPU: 1.000000}, {object_store_memory: 5.566406 GiB}. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

You might say “ok well you can solve that with placement groups”. Unfortunately, this introduces complexity in various ways:

User ends up needing to manage the creation of the placement group

Ideally, you have this:

actors = []
# start N actors
for i in range(actors):
    new_pg.start()
    actor.options(pg).remote() 
    actors += [actor]

But in order to avoid the warning message:

# start N actors
for i in range(actors):
    groups += [(new_pg.start(), new_pg.ready())]
    
while pg in ray.wait(groups):
    actor.options(pg).remote()
    actors += [actor]

If you want to reuse the placement group, you have to introduce a new state in your application level scheduling

Ideal:

# PG reuse
for old_actor, _ in ray.wait(actors):
	old_actor.stop.remote()
    pg = old_actor.get_pg()
    actors += [new_actor.options(pg).remote()]

But in order to avoid the warning message:

# PG reuse
stopping_actors = []
while True:
	actor, _ = ray.wait(actors, timeout=0.1)
    if actor:
        stopping_actors.append(actor.stop.remote())
    if stopping_actors:
        old_actor, _ = ray.wait(stopping_actors, timeout=0.1)
        if old_actor:
            pg = old_actor.get_pg()
            actors += [new_actor.options(pg).remote()]

Proposed Solution

Suppress the warning entirely for placement groups (just assume users know what they’re doing in that case).

cc @ericl @edoakes @wuisawesome @krfricke

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:29 (29 by maintainers)

github_iconTop GitHub Comments

1reaction
richardliawcommented, Feb 17, 2021

@pcmoritz does Eric’s above explanation make sense? If so, could you please +1?

It’d be good to get closure on this thread before we drag it on forever.

1reaction
ericlcommented, Feb 12, 2021

Suppress the warning entirely for placement groups (just assume users know what they’re doing in that case).

Sounds good, and pretty easy (just ignore if there is a placement group resource).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Integrate with autoscaler to improve error message: The actor ...
Integrate with autoscaler to improve error message: The actor or task with ID [] is pending and cannot currently be scheduled #8326.
Read more >
Creating actors when their amount is more than `num_cpus`
Hi guys, I would like to find out the behavior of two cases. Are those correct? The first: import ray ray.init(num_cpus=2) @ray.remote(num_cpus=1) class ......
Read more >
How to Debug Kubernetes Pending Pods and Scheduling ...
Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.
Read more >
Learn about Amazon RDS maintenance window
I want to know the pending maintenance actions and defer these ... Note: To see the Amazon RDS DB instances that are scheduled...
Read more >
Process stuck in pending state | Blue Prism Product
Hi Team,. Please Check your Licence session because if incase all your instance in Bp is used/Busy then sometimes Your Schedule process will...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found