[Serve] Issue starting Serve with Ray Client on K8s
See original GitHub issueTried to use Ray Client + Operator support. I don’t think the bug is related to the K8s operator. It’s probably some weird issue related to Serve, client, and Docker.
Using py36 image, serve.start()
after ray.util.connect(...)
prints:
(pid=329) 2021-02-10 21:25:21,900 INFO http_state.py:70 -- Starting HTTP proxy with name 'XgvdVP:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.57.184-0' on node 'node:192.168.57.184-0' listening on '127.0.0.1:8000'
(pid=332) INFO: Started server process [332]
Got Error from data channel -- shutting down: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>"
debug_error_string = "{"created":"@1613021122.279057000","description":"Error received from peer ipv4:127.0.0.1:10001","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>","grpc_status":2}"
>
Traceback (most recent call last):
File "serve_app.py", line 9, in <module>
client = serve.start()
File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 639, in start
return Client(controller, controller_name, detached=detached)
File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 112, in __init__
self._http_config = ray.get(controller.get_http_config.remote())
File "/Users/simonmo/Desktop/ray/ray/python/ray/_private/client_mode_hook.py", line 46, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/api.py", line 35, in get
return self.worker.get(vals, timeout=timeout)
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 164, in get
Exception in thread Thread-6:
Traceback (most recent call last):
File "/Users/simonmo/miniconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/Users/simonmo/miniconda3/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 87, in _data_main
raise e
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 62, in _data_main
for response in resp_stream:
File "/Users/simonmo/miniconda3/lib/python3.6/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/Users/simonmo/miniconda3/lib/python3.6/site-packages/grpc/_channel.py", line 706, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>"
debug_error_string = "{"created":"@1613021122.279057000","description":"Error received from peer ipv4:127.0.0.1:10001","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Exception iterating responses: Cloudpickle Error: Unknown type <class 'ray.serve.config.HTTPOptions'>","grpc_status":2}"
>
out = [self._get(x, timeout) for x in to_get]
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 164, in <listcomp>
out = [self._get(x, timeout) for x in to_get]
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/worker.py", line 172, in _get
data = self.data_client.GetObject(req)
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 121, in GetObject
resp = self._blocking_send(datareq)
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/dataclient.py", line 106, in _blocking_send
f"cannot send request {req}: data channel shutting down")
ConnectionError: cannot send request req_id: 5
get {
id: "W\375\307a\355\361\332{p\025\200o\036{D\316\315\255M6\002\000\000\000\001\000\000\000"
}
: data channel shutting down
Exception ignored in: <bound method Client.__del__ of <ray.serve.api.Client object at 0x7faa602812e8>>
Traceback (most recent call last):
File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 144, in __del__
self.shutdown()
File "/Users/simonmo/Desktop/ray/ray/python/ray/serve/api.py", line 157, in shutdown
if (not self._shutdown) and ray.is_initialized():
File "/Users/simonmo/Desktop/ray/ray/python/ray/_private/client_mode_hook.py", line 46, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/Users/simonmo/Desktop/ray/ray/python/ray/util/client/__init__.py", line 120, in __getattr__
raise Exception("Ray Client is not connected. "
Exception: Ray Client is not connected. Please connect by calling `ray.connect`.
(I actually don’t understand, the HTTP proxy was started??? but it depends on the HTTPOptions object.)
On Py37, even wilder error appear:
(pid=279) 2021-02-10 18:33:47,086 INFO http_state.py:70 -- Starting HTTP proxy with name 'kXHKDS:SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-node:192.168.45.111-0' on node 'node:192.168.45.111-0' listening on '127.0.0.1:8000'
(pid=298) INFO: Started server process [298]
(pid=279) 2021-02-10 18:33:47,689 INFO controller.py:190 -- Deleting endpoint 'endpoint'
(raylet) Fatal Python error: initfsencoding: Unable to get the locale encoding
(raylet) ModuleNotFoundError: No module named 'encodings'
(raylet)
(raylet) Current thread 0x00007fa860126740 (most recent call first):
The actor or task with ID ffffffffffffffff7dad32b7e02117e675ea54dc03000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {2.000000/4.000000 CPU, 0.244141 GiB/0.244141 GiB memory, 0.048828 GiB/0.048828 GiB object_store_memory, 0.980000/1.000000 node:192.168.45.111}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
I’m following this guide https://ray--14016.org.readthedocs.build/en/14016/cluster/kubernetes.html#using-ray-client-to-connect-from-outside-the-kubernetes-cluster and only changing the image tag to :nightly
and nightly-py36
.
And the script is just:
import ray
from ray import serve
ray.init(address="auto")
client = serve.start()
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (11 by maintainers)
Top Results From Across the Web
[Bug] Ray serve.start(detached = True) is not respected in ...
I tried to deploy an actor using Ray serve. It is supposed to run as detached service when serve.start(detached=True) is specified. It's not...
Read more >Deploying on Kubernetes — Ray 2.2.0
The recommended way to deploy Ray Serve is on Kubernetes, providing the best of both worlds: the user experience and scalable compute of...
Read more >Not able to start Ray client server at the second and next ...
I have a Ray cluster deployed on an AKS cluster. Currently I have only 1 worker node. The version of Ray I'm using...
Read more >Build a ML platform with Kubeflow and Ray on GKE
Ray Serve for scalable model serving. Ray Data for preprocessing ... It should be noted that Ray is not a Kubernetes-native project.
Read more >Scaling Applications on Kubernetes with Ray | by Vishnu Deva
Let's start off with setting our baseline as the Ray Autoscaler as ... like a Kubernetes Service to allow for easy connectivity to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@edoakes, can the Serve team help add such a test?
Ok let’s close it. We will re-open it or open a new one if errors come up.