KubeCluster(scheduler_service_wait_timeout) parameter is not working.
See original GitHub issueHi,
Thank you for providing Dask functionality on Kubernetes.
I am encountering a problem with the KubeCluster()
method whilst deploying in mode=remote, by passing a scheduler_pod_template of type kubernetes.client.v1pod, and configuring Dask with the scheduler service of type LoadBalancer
.
This issue refers to the scheduler_service_wait_timeout
parameter in particular. When setting the value to 300 (as in three hundred seconds and type of int), the KubeCluster()
method still times out after 30s and I get an error, even if my scheduler Pod and Service are running without issue. Regardless of which value I pass to the parameter, I timeout after 30 seconds. This timeout triggers a cleanup and kills the Pod and Service in my EKS cluster accordingly.
I’ve read through the code, and I do not see where this timeout issue actually occurs within the dask-kubernetes
package, as it appears to be defined in the dask.distributed
(in partciular in distributed/deploy/spec.py
) package specifically.
What happened:
I get a Timeout error after 30 seconds.
What you expected to happen:
I (hopefully) do not get a Timeout error since provisioning an ELB in EKS should normally take roughly 120 seconds. Ideally, I get no errors, but in this case, proof that the service timeout parameter works is a valid proof of having solved the issue.
Minimal Complete Verifiable Example:
from kubernetes import client as k8sclient
from dask_kubernetes import KubeCluster, KubeConfig
scheduler_pod = k8sclient.V1Pod(
metadata=k8sclient.V1ObjectMeta(annotations={}),
spec=k8sclient.V1PodSpec(
containers=[
k8sclient.V1Container(
name="scheduler",
image="daskdev/dask:latest",
args=[
"dask-scheduler"
])],
tolerations=[
k8sclient.V1Toleration(
effect="NoSchedule",
operator="Equal",
key="nodeType",
value="scheduler")]
)
)
dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 300})
auth = KubeConfig(config_file="~/.kube/config")
cluster = KubeCluster(pod_template="worker.yaml",
namespace='dask',
deploy_mode="remote",
n_workers=0,
scheduler_service_wait_timeout=300,
scheduler_pod_template=scheduler_pod)
Anything else we need to know?:
My EKS Cluster is Private. The Jupyter Lab machines are in a different subnet from the EKS cluster, but in the same VPC. Deployment mode is set to remote with KubeCluster()
.
We do not want to use Port-Forwarding from the Jupyter Lab machines.
Environment:
- Dask version: Docker image is
daskdev/dask:latest
- Python version: 3.8.11
- Operating System: Ubuntu 20.04.2 LTS
- Install method (conda, pip, source): Conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:22 (8 by maintainers)
Top GitHub Comments
@jacobtomlinson I can now confirm that I was able to get the
KubeCluster()
method to work. Further testing on functionality is now taking place. I had to change some network firewall rules to make this work in the end and avoid theTimeOutError
.Many thanks for the help!
Your template isn’t quite right.
annotations
isn’t a top-level key, it should be under themetadata
key. So it should probably look like this.