question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tornado Event Loop errors with KubeCluster object

See original GitHub issue

Hi,

Thanks for providing Dask, and associated repositories, packages and the ability to scale in Kubernetes with KubeCluster.

What happened:

I have created an object of type KubeCluster in Jupyter Lab within AWS, where my Jupyter Lab instance is running on a VM in its own subnet, and my EKS cluster is in a different subnet from that VM.

I have allowed for communication between these two subnets with a security group rule that allows tcp traffic on port 8786 accordingly (this is to let the distributed.client.Client talk to the Scheduler).

I am able to do things like:


cluster = KubeCluster()
cluster.scale(n)
cluster.adapt()
cluster.close()
cluster.get_logs()

client = Client(cluster)
type(client) # returns distributed.client.Client

to give a few examples.

Here is the error message I get after executing the following example computation:

import numpy as np
import dask.array as da

arr = np.random.random((10000, 10000))
display(arr)

darr = da.from_array(arr, chunks=(100, 100))
display(darr)
display(darr.compute())

Error:

CancelledError: ('array-953ece85e0008dfa81081a2f6612e3da', 16, 20)

tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f28c6a4e190>>, <Task finished name='Task-13282' coro=<Cluster._sync_cluster_info() done, defined at /opt/miniconda3/envs/mattia_env/lib/python3.8/site-packages/distributed/deploy/cluster.py:98> exception=OSError('Timed out during handshake while connecting to tcp://internal-*-region.elb.amazonaws.com:8786 after 200 s')>)
Traceback (most recent call last):
  File "/opt/miniconda3/envs/mattia_env/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

Note that the configuration I use sets the connection timeout to 200s - which is why: exception=OSError('Timed out during handshake while connecting to tcp://internal-*-region.elb.amazonaws.com:8786 after 200 s') says it is after 200 s.

Also, the stack trace associated with this error is very long since it effectively repeats the error message over and over again until I shutdown / restart the kernel in my Jupyter Notebook (I suspect this is just natural given the event loop used).

What you expected to happen:

Ideally, the computation succeeds without any issue. It should be noted that I do get the result of the computation. However, for some reason it appears it is unable to check the status, or result of a particular task once it is finished.

Also, it is worthwhile mentioning that I do get a 401 error when interacting with the KubeCluster object after some time passes (undefined). I do not have an idle_timeout option set, so this is quit strange.

Minimal Complete Verifiable Example:

from dask_kubernetes import KubeConfig, KubeCluster
from dask.distributed import Client
import dask, dask.distributed 

auth = KubeConfig(config_file="~/.kube/config", context="cluster-name")

dask.config.set({"distributed.comm.timeouts.connect": 200})
from kubernetes import client as k8sclient

# We need to construct the specification using ObjectMeta, PodSpec, Container
scheduler_pod = k8sclient.V1Pod(
                    kind="Pod",
                    metadata=k8sclient.V1ObjectMeta(annotations={}),
                    spec=k8sclient.V1PodSpec(
                        containers=[
                            k8sclient.V1Container(
                                name="scheduler",
                                image="daskdev/dask:latest",
                                args=[
                                    "dask-scheduler"
                                ])],
                        tolerations=[
                            k8sclient.V1Toleration(
                                effect="NoSchedule",
                                operator="Equal",
                                key="nodeType",
                                value="dask-scheduler")]
                    )
                )

dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 300})
dask.config.set({"kubernetes.namespace": "dask"})
dask.config.set({"kubernetes.scheduler-template": scheduler_pod})
dask.config.set({"kubernetes.scheduler-service-template": {
                    "apiVersion": "v1",
                    "kind": "Service",
                    "metadata": {
                        "annotations": {
                            "service.beta.kubernetes.io/aws-load-balancer-internal": "true",
                            "service.beta.kubernetes.io/aws-load-balancer-subnets": "subnet-id",
                        },
                    },
                    "spec": {
                        "selector": {
                            "dask.org/cluster-name": "",
                            "dask.org/component": "scheduler",
                        },
                        "ports": [
                            {
                                "name": "comm",
                                "protocol": "TCP",
                                "port": 8786,
                                "targetPort": 8786,
                            },
                            {
                                "name": "dashboard",
                                "protocol": "TCP",
                                "port": 8787,
                                "targetPort": 8787,
                            },
                        ],
                    },
                }
        }
)

cluster = KubeCluster(pod_template="worker.yaml",
                                    deploy_mode="remote",
                                    n_workers=0,
                                    scheduler_pod_template=scheduler_pod)


dclient = Client(cluster)
type(dclient)

import numpy as np
import dask.array as da

arr = np.random.random((10000, 10000))
display(arr)

darr = da.from_array(arr, chunks=(100, 100))
display(darr)
display(darr.compute())

Anything else we need to know?:

I use taints to place Dask Schedulers on a unique EKS Node Group, and also taint a unique Node Group for Dask Workers. I thought maybe this could be a reason why communication between the Scheduler and Workers is a bit unstable in this case.

Also, it should be noted that for some reason, my AWS CNI Plugin Pods try to reach out to the Scheduler on a regular basis (there is one of these per Kubernetes worker Node), but the logs indicate a Connection dropped before TCP handshake completed.

Furthermore, I use the following worker.yaml file:

# Worker
apiVersion: v1
kind: Pod
spec:
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    name: "dask-worker"
    args: ["dask-worker", 
            $(DASK_SCHEDULER_ADDRESS), 
            --memory-limit, 1GB, 
            --nthreads, '1', 
            --death-timeout, '60']
    env:
    - name: EXTRA_PIP_PACKAGES
      value: git+https://github.com/dask/distributed
    restartPolicy: OnFailure
    resources:
      limits:
        cpu: 1
        memory: 1G
      requests:
        cpu: 1
        memory: 1G
  taints:
  - key: "k8s.dask.org/dedicated"
    value: 'worker'
    operator: "Equal"
    effect: "NoSchedule"
  - key: "k8s.dask.org_dedicated"
    value: 'worker'
    operator: "Equal"
    effect: "NoSchedule"

Does this worker.yaml file look fine? I do not see any issues with this first hand, and as mentioned, all of KubeCluster methods seem to work just fine.

Environment:

Running KubeCluster in AWS EKS with Jupyter Notebook / Lab outside of the EKS Cluster.

  • Dask version: 2021.09.1
  • Python version: 3.8.11
  • Operating System: Ubuntu 20.04 LTS
  • Install method (conda, pip, source): Conda

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
jacobtomlinsoncommented, Oct 18, 2021

Ah right I see, I missed the part where you said after some inactivity. This was fixed in #286 and that’ll be in the next release.

1reaction
jacobtomlinsoncommented, Oct 18, 2021

This seems to have fixed it.

👍

Any news on when the next release will be regarding the Auth issue after inactivity?

Soon. We are working to unblock the CI in #368, then we can release. For now you can just install from main which we keep reasonably stable.

Read more comments on GitHub >

github_iconTop Results From Across the Web

KubeCluster loses the ability to run commands against a ...
What happened: Initially, all API commands run against the cluster, such as 'scale', 'close', etc.. function as expected.
Read more >
tornado.ioloop — Main event loop
In Tornado 6.0, IOLoop is a wrapper around the asyncio event loop, ... If the function returns an awaitable object, the IOLoop will...
Read more >
dask-kubernetes KubeCluster stuck - Stack Overflow
I'm trying to get up and running with dask on kubernetes. Below is effectively a hello world for dask-kubernetes, but I'm stuck on...
Read more >
Graceful shutdown of Dask Scheduler - Prefect Discourse
Hi, I have a Prefect flow that uses a DaskExecutor to distribute task runs across a Kubernetes cluster. Once all work finishes, the...
Read more >
dask/dask - Gitter
I have the problem that my calculations start as soon as there is one node out of the slurm queue, but then I...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found