[Bug] [Serve] Ray hangs on API methods
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Serve
What happened + What you expected to happen
After connecting to Ray and Ray Serve on a remote Ray cluster (running on k8s), running a job, and then waiting for a little while, future serve
/ray
methods seem to block indefinitely.
Versions / Dependencies
ray[serve]==1.9.0 Python 3.7.12
Reproduction script
Repro script with experiment results commented (Note: must edit remote cluster URL):
import logging
import time
import ray
from ray import serve
from tqdm import tqdm
logger = logging.getLogger("ray")
def init_ray(use_remote: bool = True, verbose: bool = True):
logger.info("Entering init_ray")
if ray.is_initialized():
logger.info("Ray is initialized")
# NOTE: If you put `ray.shutdown()` here and remove the return, the script will also hang on that.
return
if use_remote:
# This should be a remote ray cluster connected to with the Ray Client
address = "ray://<your Ray client URL>:10001"
logger.info("Running ray.init")
ray.init(address=address, namespace="serve", log_to_driver=verbose)
# Start Ray Serve for model serving
# Bind on 0.0.0.0 to expose the HTTP server on external IPs.
logger.info("Running serve.start")
serve.start(detached=True, http_options={"host": "0.0.0.0"})
DEPLOYMENT_NAME = "DeployClass"
ray_autoscaling_config = {
"min_replicas": 1,
"max_replicas": 100,
"target_num_ongoing_requests_per_replica": 5,
}
@serve.deployment(
name=DEPLOYMENT_NAME,
version="v1", # required for autoscaling at the moment
max_concurrent_queries=10,
_autoscaling_config=ray_autoscaling_config,
)
class DeployClass:
def f(self, i: int):
logger.info(f"Handling {i}")
time.sleep(2)
return i
def deploy_deployment():
try:
# NOTE: This is the line it stalls on! The first `serve.` line
logger.info("Trying to get existing deployment")
return serve.get_deployment(DEPLOYMENT_NAME)
except KeyError:
logger.info("DeployClass is not currently deployed, deploying...")
DeployClass.deploy()
return DeployClass
inputs = list(range(10))
for i in range(5):
logger.info("Starting ray init")
init_ray(True, True)
logger.info("Deploying deployment")
deployment = deploy_deployment()
logger.info("Getting handle")
handle = deployment.get_handle()
logger.info("Making method calls")
futures = [handle.f.remote(i) for i in inputs]
logger.info("Getting results")
results = ray.get(futures)
logger.info(f"Results: {results}")
# simulate doing lots of other work...
# Confirmed to not work:
# 1) 10m (waited 5m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# 2) 2m (waited 10m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# 3) 1m (waited 10m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# 4) 30s (waited 10m on serve.get_deployment before interrupting). Also saw
# `Polling request timed out` error on `listen_for_changes`
# Confirmed to work sometimes:
# 5) 15s (worked 2x, then stalled out on iteration #3)
# 5) 30s (worked 1x, then stalled out on iteration #2)
logger.info(f"Waiting for a while...")
for minute in tqdm(range(1)):
logger.info(f"Waiting a minute (already waited {minute})")
time.sleep(60)
Anything else
Every time for certain wait periods. See Confirmed to work
/Confirmed to not work
experiments at the bottom of the repro script.
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:17 (17 by maintainers)
Top Results From Across the Web
Troubleshooting Failures — Ray 3.0.0.dev0
Ray throws an ObjectLostError to the application when an object cannot be retrieved due to application or system error. This can occur during...
Read more >[Ray Tune] Ray crashes and system hangs - Google Groups
1. If you don't need a lot of object store memory, run ray.init(object_store_memory=int(1e9)) to limit to, e.g., 1GB. 2. Check the average RSS...
Read more >Use timeouts to avoid stuck executions - AWS Step Functions
If something goes wrong and the TimeoutSeconds field isn't specified for an Activity or Task state, an execution is stuck waiting for a...
Read more >Bug listing with status RESOLVED with resolution TEST ...
Bug :233 - "Emacs segfaults when merged through the sandbox. ... Bug:18098 - "Linux crashes shortly after boot" status:RESOLVED resolution:TEST-REQUEST ...
Read more >Is there a way to prevent ray.init() from hanging when using ...
It appears there was a bug in ray[rllib] version 1.11 that prevented ray.init() from running on the M1 Max under some circumstances.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This issue is actually also reproducible on laptop. The key is to enforce using ray client by
ray start --head
Then use
ray://127.0.0.1:10001
as address. The symptom on laptop is identical to remote cluster.PR is up #21104