Tornado Event Loop errors with KubeCluster object
See original GitHub issueHi,
Thanks for providing Dask, and associated repositories, packages and the ability to scale in Kubernetes with KubeCluster.
What happened:
I have created an object of type KubeCluster
in Jupyter Lab within AWS, where my Jupyter Lab instance is running on a VM in its own subnet, and my EKS cluster is in a different subnet from that VM.
I have allowed for communication between these two subnets with a security group rule that allows tcp traffic on port 8786 accordingly (this is to let the distributed.client.Client
talk to the Scheduler).
I am able to do things like:
cluster = KubeCluster()
cluster.scale(n)
cluster.adapt()
cluster.close()
cluster.get_logs()
client = Client(cluster)
type(client) # returns distributed.client.Client
to give a few examples.
Here is the error message I get after executing the following example computation:
import numpy as np
import dask.array as da
arr = np.random.random((10000, 10000))
display(arr)
darr = da.from_array(arr, chunks=(100, 100))
display(darr)
display(darr.compute())
Error:
CancelledError: ('array-953ece85e0008dfa81081a2f6612e3da', 16, 20)
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <zmq.eventloop.ioloop.ZMQIOLoop object at 0x7f28c6a4e190>>, <Task finished name='Task-13282' coro=<Cluster._sync_cluster_info() done, defined at /opt/miniconda3/envs/mattia_env/lib/python3.8/site-packages/distributed/deploy/cluster.py:98> exception=OSError('Timed out during handshake while connecting to tcp://internal-*-region.elb.amazonaws.com:8786 after 200 s')>)
Traceback (most recent call last):
File "/opt/miniconda3/envs/mattia_env/lib/python3.8/site-packages/distributed/comm/tcp.py", line 198, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
Note that the configuration I use sets the connection timeout to 200s - which is why: exception=OSError('Timed out during handshake while connecting to tcp://internal-*-region.elb.amazonaws.com:8786 after 200 s')
says it is after 200 s
.
Also, the stack trace associated with this error is very long since it effectively repeats the error message over and over again until I shutdown / restart the kernel in my Jupyter Notebook (I suspect this is just natural given the event loop used).
What you expected to happen:
Ideally, the computation succeeds without any issue. It should be noted that I do get the result of the computation. However, for some reason it appears it is unable to check the status, or result of a particular task once it is finished.
Also, it is worthwhile mentioning that I do get a 401 error when interacting with the KubeCluster object after some time passes (undefined). I do not have an idle_timeout
option set, so this is quit strange.
Minimal Complete Verifiable Example:
from dask_kubernetes import KubeConfig, KubeCluster
from dask.distributed import Client
import dask, dask.distributed
auth = KubeConfig(config_file="~/.kube/config", context="cluster-name")
dask.config.set({"distributed.comm.timeouts.connect": 200})
from kubernetes import client as k8sclient
# We need to construct the specification using ObjectMeta, PodSpec, Container
scheduler_pod = k8sclient.V1Pod(
kind="Pod",
metadata=k8sclient.V1ObjectMeta(annotations={}),
spec=k8sclient.V1PodSpec(
containers=[
k8sclient.V1Container(
name="scheduler",
image="daskdev/dask:latest",
args=[
"dask-scheduler"
])],
tolerations=[
k8sclient.V1Toleration(
effect="NoSchedule",
operator="Equal",
key="nodeType",
value="dask-scheduler")]
)
)
dask.config.set({"kubernetes.scheduler-service-type": "LoadBalancer"})
dask.config.set({"kubernetes.scheduler-service-wait-timeout": 300})
dask.config.set({"kubernetes.namespace": "dask"})
dask.config.set({"kubernetes.scheduler-template": scheduler_pod})
dask.config.set({"kubernetes.scheduler-service-template": {
"apiVersion": "v1",
"kind": "Service",
"metadata": {
"annotations": {
"service.beta.kubernetes.io/aws-load-balancer-internal": "true",
"service.beta.kubernetes.io/aws-load-balancer-subnets": "subnet-id",
},
},
"spec": {
"selector": {
"dask.org/cluster-name": "",
"dask.org/component": "scheduler",
},
"ports": [
{
"name": "comm",
"protocol": "TCP",
"port": 8786,
"targetPort": 8786,
},
{
"name": "dashboard",
"protocol": "TCP",
"port": 8787,
"targetPort": 8787,
},
],
},
}
}
)
cluster = KubeCluster(pod_template="worker.yaml",
deploy_mode="remote",
n_workers=0,
scheduler_pod_template=scheduler_pod)
dclient = Client(cluster)
type(dclient)
import numpy as np
import dask.array as da
arr = np.random.random((10000, 10000))
display(arr)
darr = da.from_array(arr, chunks=(100, 100))
display(darr)
display(darr.compute())
Anything else we need to know?:
I use taints to place Dask Schedulers on a unique EKS Node Group, and also taint a unique Node Group for Dask Workers. I thought maybe this could be a reason why communication between the Scheduler and Workers is a bit unstable in this case.
Also, it should be noted that for some reason, my AWS CNI Plugin Pods try to reach out to the Scheduler on a regular basis (there is one of these per Kubernetes worker Node), but the logs indicate a Connection dropped before TCP handshake completed
.
Furthermore, I use the following worker.yaml
file:
# Worker
apiVersion: v1
kind: Pod
spec:
containers:
- image: daskdev/dask:latest
imagePullPolicy: IfNotPresent
name: "dask-worker"
args: ["dask-worker",
$(DASK_SCHEDULER_ADDRESS),
--memory-limit, 1GB,
--nthreads, '1',
--death-timeout, '60']
env:
- name: EXTRA_PIP_PACKAGES
value: git+https://github.com/dask/distributed
restartPolicy: OnFailure
resources:
limits:
cpu: 1
memory: 1G
requests:
cpu: 1
memory: 1G
taints:
- key: "k8s.dask.org/dedicated"
value: 'worker'
operator: "Equal"
effect: "NoSchedule"
- key: "k8s.dask.org_dedicated"
value: 'worker'
operator: "Equal"
effect: "NoSchedule"
Does this worker.yaml
file look fine? I do not see any issues with this first hand, and as mentioned, all of KubeCluster
methods seem to work just fine.
Environment:
Running KubeCluster in AWS EKS with Jupyter Notebook / Lab outside of the EKS Cluster.
- Dask version: 2021.09.1
- Python version: 3.8.11
- Operating System: Ubuntu 20.04 LTS
- Install method (conda, pip, source): Conda
Issue Analytics
- State:
- Created 2 years ago
- Comments:12 (5 by maintainers)
Top GitHub Comments
Ah right I see, I missed the part where you said after some inactivity. This was fixed in #286 and that’ll be in the next release.
👍
Soon. We are working to unblock the CI in #368, then we can release. For now you can just install from
main
which we keep reasonably stable.