question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Ray 1.2.0 fails to connect to Ray cluster on K8s (running master)

See original GitHub issue

Working through the instructions given on https://docs.ray.io/en/master/cluster/kubernetes.html, I’m running into the following issue on this step:

Then open a new shell and try out a sample program:

$ python ray/doc/kubernetes/example_scripts/run_local_example.py
The program in this example uses ray.util.connect(127.0.0.1:10001) to connect to the Ray cluster.
Traceback (most recent call last):
  File "ray/doc/kubernetes/example_scripts/run_local_example.py", line 57, in <module>
    ray.util.connect(f"127.0.0.1:{LOCAL_PORT}")
  File "/home/stus/miniconda3/envs/fctk/lib/python3.7/site-packages/ray/util/client_connect.py", line 26, in connect
    conn_str, secure=secure, metadata=metadata, connection_retries=3)
  File "/home/stus/miniconda3/envs/fctk/lib/python3.7/site-packages/ray/util/client/__init__.py", line 57, in connect
    connection_retries=connection_retries)
  File "/home/stus/miniconda3/envs/fctk/lib/python3.7/site-packages/ray/util/client/worker.py", line 120, in __init__
    raise ConnectionError("ray client connection timeout")
ConnectionError: ray client connection timeout

I’ve confirmed that the service is available:

$  microk8s.kubectl -n ray get services
NAME                       TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
example-cluster-ray-head   ClusterIP   10.152.183.99   <none>        10001/TCP,8265/TCP,8000/TCP   16m

And I’ve enabled port forwarding:

$  microk8s.kubectl -n ray port-forward service/example-cluster-ray-head 10001:10001
Forwarding from 127.0.0.1:10001 -> 10001
Forwarding from [::1]:10001 -> 10001
Handling connection for 10001
Handling connection for 10001

I ran this all from the current master (cd89f0dc55ae98231aa08e9a0e1c80409e75acf1).

Any help would be appreciated. Thanks.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:13 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
tbabejcommented, Apr 1, 2021

Not a ray maintainer, but for what it’s worth the namespaces have different network interfaces, so if the service is exposed through a different namespace, port forwarding does not work. That’s pure K8s though, nothing to do with Ray AFAIK.

With respect to this issue however, I don’t think we should close it. It is certainly unintended behaviour that the older version of the client (1.2.0) cannot connect to a newer server (what will become 1.3.0) and fails silently at that. At the very least the docs need to be updated to point this out, but hopefully the server is modified to be compatible with the older client. Hence we should reopen, and maybe rephrase the title to “Ray 1.2.0 fails to connect to Ray cluster on K8s (running master)”

1reaction
ssiegel95commented, Mar 23, 2021

Any fix planned for the section that doesn’t work?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Connecting to remote Ray cluster on K8s
Trying to see if I can use a Ray Actor as a cache that my ML pipeline can access (would prefer to use...
Read more >
ray cluster fails to start using ray/autoscaler/local/example-full ...
I'm able to use ray start to start the master node and then have the worker node join the master manually. Ray version...
Read more >
Unable to Connect to Ray Cluster from machines other than ...
stop ray on all nodes · deleted all ray temp configuration files in /tmp/. · restart the head cluster with the .yaml file...
Read more >
Ray Documentation - Read the Docs
To work interactively, first start Ray on Kubernetes. ... To run tasks interactively on the cluster, connect to one of the pods, e.g.,....
Read more >
KubeRay Operator - Ray.io
The KubeRay Operator automates Ray cluster lifecycle management, autoscaling, and other critical ... Use of Kubernetes PodTemplates to configure Ray pods.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found