question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Bug] No timeout or deadlock on scheduling job in remote cluster

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Sometimes calling my_function.remote(args) never returns.

Used python faulthandler module to get stack trace of frozen process and it looks like there is some deadlock or missing timeout on network call:

  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 281 in _async_send
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 363 in ReleaseObject
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 532 in _release_server
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 526 in call_release
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/api.py", line 118 in call_release
  File "/usr/lib/python3.9/queue.py", line 133 in put
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 287 in _async_send
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/dataclient.py", line 368 in Schedule
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 500 in _call_schedule_for_task
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/worker.py", line 459 in call_remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/api.py", line 106 in call_remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/common.py", line 380 in remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/client/common.py", line 130 in _remote
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 173 in client_mode_convert_function
  File "/usr/local/lib/python3.9/dist-packages/ray/remote_function.py", line 222 in _remote
  File "/usr/local/lib/python3.9/dist-packages/ray/util/tracing/tracing_helper.py", line 295 in _invocation_remote_span
  File "/usr/local/lib/python3.9/dist-packages/ray/remote_function.py", line 180 in remote

Generally it would great to have timeouts on all ray functions which deal with network. It would make recovery possible in client code.

Versions / Dependencies

Ray 1.8 Debian stable based docker with python 3.9

Reproduction script

Did not found a way to reliably reproduce this but it is triggered by any function.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
scv119commented, Jan 9, 2022

@jakub-valenta do you have a simple script that we could use to reproduce this?

1reaction
scv119commented, Jan 9, 2022

Should this be bumped to p0?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve blocking problem caused by lock escalation - SQL ...
This article describes how to determine whether lock escalation is causing blocking and how to resolve the problem.
Read more >
Troubleshooting Hazelcast cluster management - Pega Support
Hazelcast logs clog the server space at a rapid pace, triggering error every few milliseconds: com.hazelcast. core.OperationTimeoutException ...
Read more >
Tivoli Workload Scheduler: Troubleshooting Guide - IBM
Submitted job is not running on a dynamic agent 94 ... Workload Scheduler logs (from the Maestro, Unison, Netman, Cluster, and Altinst catalogs)....
Read more >
Troubleshoot Dataflow errors - Google Cloud
These errors typically occur when some of your running Dataflow jobs use the same temp_location to stage temporary job files created when the...
Read more >
Configuration | Apache Flink
On session clusters, the provided configuration will only be used for configuring execution parameters, e.g. configuration parameters affecting the job, not ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found