question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KubeCluster(scheduler_service_wait_timeout) parameter is not working.

See original GitHub issue

Hi,

Thank you for providing Dask functionality on Kubernetes.

I am encountering a problem with the KubeCluster() method whilst deploying in mode=remote, by passing a scheduler_pod_template of type kubernetes.client.v1pod, and configuring Dask with the scheduler service of type LoadBalancer.

This issue refers to the scheduler_service_wait_timeout parameter in particular. When setting the value to 300 (as in three hundred seconds and type of int), the KubeCluster() method still times out after 30s and I get an error, even if my scheduler Pod and Service are running without issue. Regardless of which value I pass to the parameter, I timeout after 30 seconds. This timeout triggers a cleanup and kills the Pod and Service in my EKS cluster accordingly.

I’ve read through the code, and I do not see where this timeout issue actually occurs within the dask-kubernetes package, as it appears to be defined in the dask.distributed (in partciular in distributed/deploy/spec.py) package specifically.

What happened:

I get a Timeout error after 30 seconds.

What you expected to happen:

I (hopefully) do not get a Timeout error since provisioning an ELB in EKS should normally take roughly 120 seconds. Ideally, I get no errors, but in this case, proof that the service timeout parameter works is a valid proof of having solved the issue.

Minimal Complete Verifiable Example:



from kubernetes import client as k8sclient
from dask_kubernetes import KubeCluster, KubeConfig

scheduler_pod = k8sclient.V1Pod(
                    metadata=k8sclient.V1ObjectMeta(annotations={}),
                    spec=k8sclient.V1PodSpec(
                        containers=[
                            k8sclient.V1Container(
                                name="scheduler",
                                image="daskdev/dask:latest",
                                args=[
                                    "dask-scheduler"
                                ])],
                        tolerations=[
                            k8sclient.V1Toleration(
                                effect="NoSchedule",
                                operator="Equal",
                                key="nodeType",
                                value="scheduler")]
                    )
                )

dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 300})

auth = KubeConfig(config_file="~/.kube/config")

cluster = KubeCluster(pod_template="worker.yaml",
                      namespace='dask',
                      deploy_mode="remote",
                      n_workers=0,
                      scheduler_service_wait_timeout=300,
                      scheduler_pod_template=scheduler_pod)

Anything else we need to know?:

My EKS Cluster is Private. The Jupyter Lab machines are in a different subnet from the EKS cluster, but in the same VPC. Deployment mode is set to remote with KubeCluster().

We do not want to use Port-Forwarding from the Jupyter Lab machines.

Environment:

  • Dask version: Docker image is daskdev/dask:latest
  • Python version: 3.8.11
  • Operating System: Ubuntu 20.04.2 LTS
  • Install method (conda, pip, source): Conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:22 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
mb-qcocommented, Oct 14, 2021

@jacobtomlinson I can now confirm that I was able to get the KubeCluster() method to work. Further testing on functionality is now taking place. I had to change some network firewall rules to make this work in the end and avoid the TimeOutError.

Many thanks for the help!

1reaction
jacobtomlinsoncommented, Oct 14, 2021

Your template isn’t quite right. annotations isn’t a top-level key, it should be under the metadata key. So it should probably look like this.

dask.config.set(
    {
        "kubernetes.scheduler-service-template": {
            "apiVersion": "v1",
            "kind": "Service",
            "metadata": {
                "annotations": {
                    "service.beta.kubernetes.io/aws-load-balancer-internal": "true",
                },
            },
            "spec": {
                "selector": {
                    "dask.org/cluster-name": "",
                    "dask.org/component": "scheduler",
                },
                "ports": [
                    {
                        "name": "comm",
                        "protocol": "TCP",
                        "port": 8786,
                        "targetPort": 8786,
                    },
                    {
                        "name": "dashboard",
                        "protocol": "TCP",
                        "port": 8787,
                        "targetPort": 8787,
                    },
                ],
            },
        }
    }
)
Read more comments on GitHub >

github_iconTop Results From Across the Web

KubeCluster provisions pod but times out before returning ...
I'm trying to use KubeCluster to start a Dask cluster on a remote K3S cluster (i.e. not minikube/kind on ... parameter is not...
Read more >
Source code for dask_kubernetes.classic.kubecluster
If the current process is not already on a Kubernetes node, some network configuration will likely be required to make this work.
Read more >
KubeCluster fails to adapt · Issue #244 · dask/dask-kubernetes
seems like the scheduler knows one worker is gone, but the cluster.workers doesn't, also, there is no new pod started after half an...
Read more >
Distributed XGBoost with Dask — xgboost 1.7.2 documentation
Open an issue if such case is found and there's no documentation on how to resolve it in that cluster implementation. Threads ...
Read more >
dask-kubernetes KubeCluster stuck - Stack Overflow
In the pod template (pod-spec.yaml), the field metadata.name is set. Removing this allowed the code to run. It appears that dask-kubernetes ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found