Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Serve] Issue starting Serve with Ray Client on K8s

See original GitHub issue

Tried to use Ray Client + Operator support. I don’t think the bug is related to the K8s operator. It’s probably some weird issue related to Serve, client, and Docker.

Using py36 image, serve.start() after ray.util.connect(...) prints:

(pid=329) 2021-02-10 21:25:21,900       INFO http_state.py:70 -- Starting HTTP proxy with name 'XgvdVP:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.57.184-0' on node 'node:192.168.57.184-0' listening on '127.0.0.1:8000'
(pid=332) INFO:     Started server process [332]
Got Error from data channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>"
        debug_error_string = "{"created":"@1613021122.279057000","description":"Error received from peer ipv4:127.0.0.1:10001","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>","grpc_status":2}"
>
Traceback (most recent call last):
  File "serve_app.py", line 9, in <module>
    client = serve.start()
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 639, in start
    return Client(controller, controller_name, detached=detached)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 112, in __init__
    self._http_config = ray.get(controller.get_http_config.remote())
  File "/Users/simonmo/Desktop/ray/ray/python/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/api.py", line 35, in get
    return self.worker.get(vals, timeout=timeout)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 164, in get
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/Users/simonmo/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/simonmo/miniconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 87, in _data_main
    raise e
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 62, in _data_main
    for response in resp_stream:
  File "/Users/simonmo/miniconda3/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/Users/simonmo/miniconda3/lib/python3.6/site-packages/grpc/_channel.py", line 706, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>"
        debug_error_string = "{"created":"@1613021122.279057000","description":"Error received from peer ipv4:127.0.0.1:10001","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>","grpc_status":2}"
>

    out = [self._get(x, timeout) for x in to_get]
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 164, in <listcomp>
    out = [self._get(x, timeout) for x in to_get]
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 172, in _get
    data = self.data_client.GetObject(req)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 121, in GetObject
    resp = self._blocking_send(datareq)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 106, in _blocking_send
    f"cannot send request {req}: data channel shutting down")
ConnectionError: cannot send request req_id: 5
get {
  id: "W\375\307a\355\361\332{p\025\200o\036{D\316\315\255M6\002\000\000\000\001\000\000\000"
}
: data channel shutting down
Exception ignored in: <bound method Client.__del__ of <ray.serve.api.Client object at 0x7faa602812e8>>
Traceback (most recent call last):
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 144, in __del__
    self.shutdown()
  File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 157, in shutdown
    if (not self._shutdown) and ray.is_initialized():
  File "/Users/simonmo/Desktop/ray/ray/python/ray/_private/client_mode_hook.py", line 46, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/__init__.py", line 120, in __getattr__
    raise Exception("Ray Client is not connected. "
Exception: Ray Client is not connected. Please connect by calling `ray.connect`.

(I actually don’t understand, the HTTP proxy was started??? but it depends on the HTTPOptions object.)

On Py37, even wilder error appear:

(pid=279) 2021-02-10 18:33:47,086       INFO http_state.py:70 -- Starting HTTP proxy with name 'kXHKDS:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.45.111-0' on node 'node:192.168.45.111-0' listening on '127.0.0.1:8000'
(pid=298) INFO:     Started server process [298]
(pid=279) 2021-02-10 18:33:47,689       INFO controller.py:190 -- Deleting endpoint 'endpoint'
(raylet) Fatal Python error: initfsencoding: Unable to get the locale encoding
(raylet) ModuleNotFoundError: No module named 'encodings'
(raylet) 
(raylet) Current thread 0x00007fa860126740 (most recent call first):
The actor or task with ID ffffffffffffffff7dad32b7e02117e675ea54dc03000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {2.000000/4.000000 CPU, 0.244141 GiB/0.244141 GiB memory, 0.048828 GiB/0.048828 GiB object_store_memory, 0.980000/1.000000 node:192.168.45.111}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

I’m following this guide https://ray--14016.org.readthedocs.build/en/14016/cluster/kubernetes.html#using-ray-client-to-connect-from-outside-the-kubernetes-cluster and only changing the image tag to :nightly and nightly-py36.

And the script is just:

import ray
from ray import serve

ray.init(address="auto")
client = serve.start()

Issue Analytics

State:
Created 3 years ago
Comments:11 (11 by maintainers)

Top GitHub Comments

1reaction

AmeerHajAlicommented, Mar 29, 2021

@edoakes, can the Serve team help add such a test?

0reactions

simon-mocommented, Apr 26, 2021

Ok let’s close it. We will re-open it or open a new one if errors come up.

Top Results From Across the Web

[Bug] Ray serve.start(detached = True) is not respected in ...

I tried to deploy an actor using Ray serve. It is supposed to run as detached service when serve.start(detached=True) is specified. It's not...

Deploying on Kubernetes — Ray 2.2.0

The recommended way to deploy Ray Serve is on Kubernetes, providing the best of both worlds: the user experience and scalable compute of...

Not able to start Ray client server at the second and next ...

I have a Ray cluster deployed on an AKS cluster. Currently I have only 1 worker node. The version of Ray I'm using...

Build a ML platform with Kubeflow and Ray on GKE

Ray Serve for scalable model serving. Ray Data for preprocessing ... It should be noted that Ray is not a Kubernetes-native project.

Scaling Applications on Kubernetes with Ray | by Vishnu Deva

Let's start off with setting our baseline as the Ray Autoscaler as ... like a Kubernetes Service to allow for easy connectivity to...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

[Serve] Issue starting Serve with Ray Client on K8s

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[core] Tasks are not spilled back when waiting for dependencies, even when there are remote resources available

Avoiding the `pending and cannot currently be scheduled` warning