Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] [Serve] Ray hangs on API methods

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Serve

What happened + What you expected to happen

After connecting to Ray and Ray Serve on a remote Ray cluster (running on k8s), running a job, and then waiting for a little while, future serve/ray methods seem to block indefinitely.

Versions / Dependencies

ray[serve]==1.9.0 Python 3.7.12

Reproduction script

Repro script with experiment results commented (Note: must edit remote cluster URL):

import logging
import time

import ray
from ray import serve
from tqdm import tqdm

logger = logging.getLogger("ray")


def init_ray(use_remote: bool = True, verbose: bool = True):
    logger.info("Entering init_ray")
    if ray.is_initialized():
        logger.info("Ray is initialized")
        # NOTE: If you put `ray.shutdown()` here and remove the return, the script will also hang on that.
        return

    if use_remote:
        # This should be a remote ray cluster connected to with the Ray Client
        address = "ray://<your Ray client URL>:10001"
        logger.info("Running ray.init")
        ray.init(address=address, namespace="serve", log_to_driver=verbose)

        # Start Ray Serve for model serving
        # Bind on 0.0.0.0 to expose the HTTP server on external IPs.
        logger.info("Running serve.start")
        serve.start(detached=True, http_options={"host": "0.0.0.0"})


DEPLOYMENT_NAME = "DeployClass"
ray_autoscaling_config = {
    "min_replicas": 1,
    "max_replicas": 100,
    "target_num_ongoing_requests_per_replica": 5,
}


@serve.deployment(
    name=DEPLOYMENT_NAME,
    version="v1",  # required for autoscaling at the moment
    max_concurrent_queries=10,
    _autoscaling_config=ray_autoscaling_config,
)
class DeployClass:
    def f(self, i: int):
        logger.info(f"Handling {i}")
        time.sleep(2)
        return i


def deploy_deployment():
    try:
        # NOTE: This is the line it stalls on! The first `serve.` line
        logger.info("Trying to get existing deployment")
        return serve.get_deployment(DEPLOYMENT_NAME)
    except KeyError:
        logger.info("DeployClass is not currently deployed, deploying...")
        DeployClass.deploy()
        return DeployClass


inputs = list(range(10))

for i in range(5):
    logger.info("Starting ray init")
    init_ray(True, True)
    logger.info("Deploying deployment")
    deployment = deploy_deployment()
    logger.info("Getting handle")
    handle = deployment.get_handle()

    logger.info("Making method calls")
    futures = [handle.f.remote(i) for i in inputs]
    logger.info("Getting results")
    results = ray.get(futures)
    logger.info(f"Results: {results}")

    # simulate doing lots of other work...
    # Confirmed to not work:
    # 1) 10m (waited 5m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 2) 2m (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 3) 1m (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # 4) 30s (waited 10m on serve.get_deployment before interrupting). Also saw
    #    `Polling request timed out` error on `listen_for_changes`
    # Confirmed to work sometimes:
    # 5) 15s (worked 2x, then stalled out on iteration #3)
    # 5) 30s (worked 1x, then stalled out on iteration #2)
    logger.info(f"Waiting for a while...")
    for minute in tqdm(range(1)):
        logger.info(f"Waiting a minute (already waited {minute})")
        time.sleep(60)

Anything else

Every time for certain wait periods. See Confirmed to work/Confirmed to not work experiments at the bottom of the repro script.

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:17 (17 by maintainers)

Top GitHub Comments

4reactions

jiaodongcommented, Dec 13, 2021

This issue is actually also reproducible on laptop. The key is to enforce using ray client by

ray start --head

Then use ray://127.0.0.1:10001 as address. The symptom on laptop is identical to remote cluster.

2reactions

simon-mocommented, Dec 15, 2021

PR is up #21104

Top Results From Across the Web

Troubleshooting Failures — Ray 3.0.0.dev0

Ray throws an ObjectLostError to the application when an object cannot be retrieved due to application or system error. This can occur during...

[Ray Tune] Ray crashes and system hangs - Google Groups

1. If you don't need a lot of object store memory, run ray.init(object_store_memory=int(1e9)) to limit to, e.g., 1GB. 2. Check the average RSS...

Use timeouts to avoid stuck executions - AWS Step Functions

If something goes wrong and the TimeoutSeconds field isn't specified for an Activity or Task state, an execution is stuck waiting for a...

Bug listing with status RESOLVED with resolution TEST ...

Bug :233 - "Emacs segfaults when merged through the sandbox. ... Bug:18098 - "Linux crashes shortly after boot" status:RESOLVED resolution:TEST-REQUEST ...

Is there a way to prevent ray.init() from hanging when using ...

It appears there was a bug in ray[rllib] version 1.11 that prevented ray.init() from running on the M1 Max under some circumstances.