Ray 1.2.0 fails to connect to Ray cluster on K8s (running master)
See original GitHub issueWorking through the instructions given on https://docs.ray.io/en/master/cluster/kubernetes.html, I’m running into the following issue on this step:
Then open a new shell and try out a sample program:
$ python ray/doc/kubernetes/example_scripts/run_local_example.py
The program in this example uses ray.util.connect(127.0.0.1:10001) to connect to the Ray cluster.
Traceback (most recent call last):
File "ray/doc/kubernetes/example_scripts/run_local_example.py", line 57, in <module>
ray.util.connect(f"127.0.0.1:{LOCAL_PORT}")
File "/home/stus/miniconda3/envs/fctk/lib/python3.7/site-packages/ray/util/client_connect.py", line 26, in connect
conn_str, secure=secure, metadata=metadata, connection_retries=3)
File "/home/stus/miniconda3/envs/fctk/lib/python3.7/site-packages/ray/util/client/__init__.py", line 57, in connect
connection_retries=connection_retries)
File "/home/stus/miniconda3/envs/fctk/lib/python3.7/site-packages/ray/util/client/worker.py", line 120, in __init__
raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout
I’ve confirmed that the service is available:
$ microk8s.kubectl -n ray get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
example-cluster-ray-head ClusterIP 10.152.183.99 <none> 10001/TCP,8265/TCP,8000/TCP 16m
And I’ve enabled port forwarding:
$ microk8s.kubectl -n ray port-forward service/example-cluster-ray-head 10001:10001
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001
Handling connection for 10001
Handling connection for 10001
I ran this all from the current master (cd89f0dc55ae98231aa08e9a0e1c80409e75acf1).
Any help would be appreciated. Thanks.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:13 (6 by maintainers)
Top Results From Across the Web
Connecting to remote Ray cluster on K8s
Trying to see if I can use a Ray Actor as a cache that my ML pipeline can access (would prefer to use...
Read more >ray cluster fails to start using ray/autoscaler/local/example-full ...
I'm able to use ray start to start the master node and then have the worker node join the master manually. Ray version...
Read more >Unable to Connect to Ray Cluster from machines other than ...
stop ray on all nodes · deleted all ray temp configuration files in /tmp/. · restart the head cluster with the .yaml file...
Read more >Ray Documentation - Read the Docs
To work interactively, first start Ray on Kubernetes. ... To run tasks interactively on the cluster, connect to one of the pods, e.g.,....
Read more >KubeRay Operator - Ray.io
The KubeRay Operator automates Ray cluster lifecycle management, autoscaling, and other critical ... Use of Kubernetes PodTemplates to configure Ray pods.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Not a ray maintainer, but for what it’s worth the namespaces have different network interfaces, so if the service is exposed through a different namespace, port forwarding does not work. That’s pure K8s though, nothing to do with Ray AFAIK.
With respect to this issue however, I don’t think we should close it. It is certainly unintended behaviour that the older version of the client (1.2.0) cannot connect to a newer server (what will become 1.3.0) and fails silently at that. At the very least the docs need to be updated to point this out, but hopefully the server is modified to be compatible with the older client. Hence we should reopen, and maybe rephrase the title to “Ray 1.2.0 fails to connect to Ray cluster on K8s (running master)”
Any fix planned for the section that doesn’t work?