[Core] [Bug] No timeout or deadlock on scheduling job in remote cluster
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Clusters
What happened + What you expected to happen
Sometimes calling my_function.remote(args)
never returns.
Used python faulthandler
module to get stack trace of frozen process and it looks like there is some deadlock or missing timeout on network call:
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 281 in _async_send
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 363 in ReleaseObject
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 532 in _release_server
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 526 in call_release
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/api.py", line 118 in call_release
File "/usr/lib/python3.9/queue.py", line 133 in put
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 287 in _async_send
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 368 in Schedule
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 500 in _call_schedule_for_task
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 459 in call_remote
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/api.py", line 106 in call_remote
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/common.py", line 380 in remote
File "/usr/local/lib/python3.9/dist-packages/ray/util/client/common.py", line 130 in _remote
File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 173 in client_mode_convert_function
File "/usr/local/lib/python3.9/dist-packages/ray/remote_function.py", line 222 in _remote
File "/usr/local/lib/python3.9/dist-packages/ray/util/tracing/tracing_helper.py", line 295 in _invocation_remote_span
File "/usr/local/lib/python3.9/dist-packages/ray/remote_function.py", line 180 in remote
Generally it would great to have timeouts on all ray functions which deal with network. It would make recovery possible in client code.
Versions / Dependencies
Ray 1.8 Debian stable based docker with python 3.9
Reproduction script
Did not found a way to reliably reproduce this but it is triggered by any function.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Resolve blocking problem caused by lock escalation - SQL ...
This article describes how to determine whether lock escalation is causing blocking and how to resolve the problem.
Read more >Troubleshooting Hazelcast cluster management - Pega Support
Hazelcast logs clog the server space at a rapid pace, triggering error every few milliseconds: com.hazelcast. core.OperationTimeoutException ...
Read more >Tivoli Workload Scheduler: Troubleshooting Guide - IBM
Submitted job is not running on a dynamic agent 94 ... Workload Scheduler logs (from the Maestro, Unison, Netman, Cluster, and Altinst catalogs)....
Read more >Troubleshoot Dataflow errors - Google Cloud
These errors typically occur when some of your running Dataflow jobs use the same temp_location to stage temporary job files created when the...
Read more >Configuration | Apache Flink
On session clusters, the provided configuration will only be used for configuring execution parameters, e.g. configuration parameters affecting the job, not ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jakub-valenta do you have a simple script that we could use to reproduce this?
Should this be bumped to p0?