question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Bug] Recursive ray.util.multiprocessing.Pool deadlock

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Using from ray.util.multiprocessing import Pool recursively generates a deadlock. Generating ray actors recursively does not get stuck. I assume a solution of the same principle as this could solve the issue: https://github.com/ray-project/ray/pull/1920

I gave the code generating the error below:

2022-01-09 15:20:50,690 INFO services.py:1265 – View the Ray dashboard at omitted 2022-01-09 15:21:10,746 WARNING worker.py:1215 – The actor or task with ID ffffffffffffffffe7315b869e1f68be487f8f1301000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster. Required resources for this actor or task: {CPU: 1.000000} Available resources on this node: {0.000000/1.000000 CPU, 31.779781 GiB/31.779781 GiB memory, 1.000000/1.000000 GPU, 15.889890 GiB/15.889890 GiB object_store_memory, 1.000000/1.000000 node:172.17.0.2, 1.000000/1.000000 accelerator_type:RTX} In total there are 0 pending tasks and 1 pending actors on this node.

Versions / Dependencies

ray version 1.9.1

Reproduction script

import ray
from ray.util.multiprocessing import Pool


def poolit_a(idx):
    with Pool(ray_address='auto') as pool:
        return list(pool.map(np.sqrt, np.arange(0, 2, 1)))



def poolit_b():
    with Pool(ray_address='auto') as pool:
        return list(pool.map(poolit_a, range(2, 4, 1)))


if __name__ == '__main__':
    try:
        ray.init(num_cpus=1)
        print(poolit_b())
    finally:
        ray.shutdown()

Anything else

Every run.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
yogeverancommented, Feb 2, 2022

Design wise:

I think that Pool actor resource allocation should be defined as an argument to the Pool constructor, and not statically. And you can set the default as 0 for compatibility with Python multiprocessing pool.

But if someone sets it more than 0 then we are back to the same dead lock issue. The issue can be solved if an actor is waiting to one of its children to finish, then it can free any CPU/GPU (Not Memory) resources which aren’t used. The release of resources can only happen if all of the children (separately, not sum) have equal or greater resources requirements.

This way we know the parent can resume running as soon as any of its children finishes.

Practically: Your suggestion to just set the value 0 solves the bug. My design suggestion can be converted to a feature request which is less urgent.

0reactions
czgdp1807commented, Feb 2, 2022

The issue can be solved if an actor is waiting to one of its children to finish, then it can free any CPU/GPU (Not Memory) resources which aren’t used. The release of resources can only happen if all of the children (separately, not sum) have equal or greater resources requirements.

Yes., theoritically that seems correct to me. In fact, I would like to know if there is any internal C++ API in ray which maintains parent-child relationships between tasks. That would lead to a very simple solution and no-risk solution for this problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed multiprocessing.Pool — Ray 2.2.0
To get started, first install Ray, then use ray.util.multiprocessing.Pool in place of multiprocessing.Pool . This will start a local Ray cluster the first ......
Read more >
Occasional deadlock in multiprocessing.Pool - Stack Overflow
there's no bug in how a task is defined (i.e. they are all loadable). Update: It seems that deadlock occurs even with a...
Read more >
perlthrtut - Tutorial on threads in Perl - Perldoc Browser
This condition is called a deadlock, and it occurs whenever two or more threads are trying to get locks on resources that the...
Read more >
joblib Documentation - Read the Docs
Using the 'multiprocessing' backend can cause a crash when using third party libraries that manage their own native thread-pool if the library ...
Read more >
Untitled
(Anand Khoje) [Orabug: 33134287] - IB/core: Removed port validity check from ... sb: Fix potential ABBA deadlock in CSP driver (Takashi Iwai) -...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found