[Core] [Bug] Recursive ray.util.multiprocessing.Pool deadlock
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
Using from ray.util.multiprocessing import Pool recursively generates a deadlock. Generating ray actors recursively does not get stuck. I assume a solution of the same principle as this could solve the issue: https://github.com/ray-project/ray/pull/1920
I gave the code generating the error below:
2022-01-09 15:20:50,690 INFO services.py:1265 – View the Ray dashboard at omitted 2022-01-09 15:21:10,746 WARNING worker.py:1215 – The actor or task with ID ffffffffffffffffe7315b869e1f68be487f8f1301000000 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install. Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster. Required resources for this actor or task: {CPU: 1.000000} Available resources on this node: {0.000000/1.000000 CPU, 31.779781 GiB/31.779781 GiB memory, 1.000000/1.000000 GPU, 15.889890 GiB/15.889890 GiB object_store_memory, 1.000000/1.000000 node:172.17.0.2, 1.000000/1.000000 accelerator_type:RTX} In total there are 0 pending tasks and 1 pending actors on this node.
Versions / Dependencies
ray version 1.9.1
Reproduction script
import ray
from ray.util.multiprocessing import Pool
def poolit_a(idx):
with Pool(ray_address='auto') as pool:
return list(pool.map(np.sqrt, np.arange(0, 2, 1)))
def poolit_b():
with Pool(ray_address='auto') as pool:
return list(pool.map(poolit_a, range(2, 4, 1)))
if __name__ == '__main__':
try:
ray.init(num_cpus=1)
print(poolit_b())
finally:
ray.shutdown()
Anything else
Every run.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (7 by maintainers)
Top GitHub Comments
Design wise:
I think that Pool actor resource allocation should be defined as an argument to the Pool constructor, and not statically. And you can set the default as 0 for compatibility with Python multiprocessing pool.
But if someone sets it more than 0 then we are back to the same dead lock issue. The issue can be solved if an actor is waiting to one of its children to finish, then it can free any CPU/GPU (Not Memory) resources which aren’t used. The release of resources can only happen if all of the children (separately, not sum) have equal or greater resources requirements.
This way we know the parent can resume running as soon as any of its children finishes.
Practically: Your suggestion to just set the value 0 solves the bug. My design suggestion can be converted to a feature request which is less urgent.
Yes., theoritically that seems correct to me. In fact, I would like to know if there is any internal C++ API in ray which maintains parent-child relationships between tasks. That would lead to a very simple solution and no-risk solution for this problem.