Why does the computation on my GKE Dask cluster fail?
See original GitHub issueHello!
The cluster:
gcloud container clusters create "cluster-1" --zone "europe-west1-c" --machine-type "e2-standard-4" --num-nodes "2"
gcloud container clusters get-credentials cluster-1 --zone europe-west1-c
worker.yaml:
kind: Pod
metadata:
labels:
foo: bar
spec:
restartPolicy: Never
containers:
- image: daskdev/dask:latest
imagePullPolicy: IfNotPresent
args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 2GB, --death-timeout, '60']
name: dask
env:
- name: EXTRA_PIP_PACKAGES
value: git+https://github.com/dask/distributed
resources:
limits:
cpu: 1
memory: 2G
requests:
cpu: 1
memory: 2G
Deploying Scheduler and 2 Workers:
>>> from dask_kubernetes import KubeCluster
>>> cluster = KubeCluster('worker.yaml')
Creating scheduler pod on cluster. This may take some time.
Forwarding from 127.0.0.1:53648 -> 8786
Forwarding from [::1]:53648 -> 8786
Handling connection for 53648
Handling connection for 53648
/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found
+-------------+-----------+-----------------------+---------+
| Package | client | scheduler | workers |
+-------------+-----------+-----------------------+---------+
| distributed | 2021.04.0 | 2021.04.0+7.g053f99b8 | None |
+-------------+-----------+-----------------------+---------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
Handling connection for 53648
>>> cluster.scale(2)
>>> cluster.workers
{0: <Pod Worker: status=Status.running>, 1: <Pod Worker: status=Status.running>}
>>> print(json.dumps(cluster.scheduler_info, indent=2))
{
"type": "Scheduler",
"id": "Scheduler-323084ab-db05-4639-bfb8-803d0d4c92f6",
"address": "tcp://10.0.0.4:8786",
"services": {
"dashboard": 8787
},
"started": 1618243652.8047366,
"workers": {
"tcp://10.0.0.5:34101": {
"type": "Worker",
"id": 1,
"host": "10.0.0.5",
"resources": {},
"local_directory": "/dask-worker-space/worker-u52d79l5",
"name": 1,
"nthreads": 1,
"memory_limit": 1999998976,
"services": {
"dashboard": 35195
},
"nanny": "tcp://10.0.0.5:35723"
},
"tcp://10.0.1.9:37065": {
"type": "Worker",
"id": 0,
"host": "10.0.1.9",
"resources": {},
"local_directory": "/dask-worker-space/worker-tjbrsf1d",
"name": 0,
"nthreads": 1,
"memory_limit": 1999998976,
"services": {
"dashboard": 38733
},
"nanny": "tcp://10.0.1.9:34771"
}
}
}
Everything seems to be working fine so far:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/dask-raffael-f4ac2841-7ddfn7 1/1 Running 0 5m51s
pod/dask-raffael-f4ac2841-7dgpj8 1/1 Running 0 6m26s
pod/dask-raffael-f4ac2841-7drdth 1/1 Running 0 5m51s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/dask-raffael-f4ac2841-7 ClusterIP 10.3.249.221 <none> 8786/TCP,8787/TCP 5m56s
service/kubernetes ClusterIP 10.3.240.1 <none> 443/TCP 16m
Now I provide a task for the cluster:
>>> import dask.array as da
>>> from dask.distributed import Client
>>> client = Client(cluster)
Handling connection for 62631
Handling connection for 62631
/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found
+-------------+---------------+-----------------------+-----------------------+
| Package | client | scheduler | workers |
+-------------+---------------+-----------------------+-----------------------+
| distributed | 2021.04.0 | 2021.04.0+7.g053f99b8 | 2021.04.0+7.g053f99b8 |
| numpy | 1.20.1 | 1.18.1 | 1.18.1 |
| python | 3.8.8.final.0 | 3.8.0.final.0 | 3.8.0.final.0 |
+-------------+---------------+-----------------------+-----------------------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
>>> x = da.random.random((3000, 3000), chunks=(1000, 1000))
>>> y = x + x.T
>>> y.compute()
Traceback (most recent call last):
Handling connection for 62631
File "<stdin>", line 1, in <module>
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/dask/base.py", line 284, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/dask/base.py", line 566, in compute
results = schedule(dsk, keys, **kwargs)
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 2666, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 1975, in gather
return self.sync(
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 843, in sync
return sync(
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/utils.py", line 353, in sync
raise exc.with_traceback(tb)
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/utils.py", line 336, in f
result[0] = yield future
File "/home/raffael/.local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 1841, in _gather
raise exc
concurrent.futures._base.CancelledError: ('add-6fdc66ee13ca592f908534ae532b03a0', 2, 2)
>>> Handling connection for 62631
/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found
+-------------+---------------+-----------------------+-----------------------+
| Package | client | scheduler | workers |
+-------------+---------------+-----------------------+-----------------------+
| distributed | 2021.04.0 | 2021.04.0+7.g053f99b8 | 2021.04.0+7.g053f99b8 |
| numpy | 1.20.1 | 1.18.1 | 1.18.1 |
| python | 3.8.8.final.0 | 3.8.0.final.0 | 3.8.0.final.0 |
+-------------+---------------+-----------------------+-----------------------+
warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
As far as I can tell that should work just fine. Any ideas what is going wrong here?
Thanks
Raffael
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Error connecting to the scheduler · Issue #45 · dask/dask-gke
I got the cluster successfully spawned and it showed a message like ... i actually try to run some compute it does not...
Read more >Troubleshooting | Google Kubernetes Engine (GKE)
The command fails and displays an error message, usually with HTTP status code 401 (Unauthorized).
Read more >Dask keeps failing with killed worker exception while running ...
Im running tpot with dask running on kubernetes cluster on gcp, the cluster is 24 cores 120 gb memory with 4 nodes of...
Read more >Resilience against hardware failures - Dask Examples
Scenario: We have a cluster that partially consists of preemptible ressources. That is, we'll have to deal with workers suddenly being shut down...
Read more >Scalable Machine Learning with Dask on Google Cloud
Our first step is to set up a Kubernetes Cluster through Google Kubernetes Engine (GKE). ... You should see a button similar to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just writing this here as I don’t have anything worthy of a new issue.
I followed the quickstart again, installing nothing but
dask-kubernetes
and (apart from the version mismatch warnings) it all works fine.Sorry for the confusion, clearly something clicked in the interim and I misattributed why it started working.
@carderne sorry to hear you had troubles getting started.
Running
pip install dask-kubernetes
should give you everything you need with the right versions. If it doesn’t that’s a bug and we need to fix it. If you are having issues with a fresh install could you please raise a new issue and share steps you took and error messages so we can troubleshoot?The above issue was caused by a different version of distributed being installed in the docker image than on the client. Given there hasn’t been any update for half a year I’m going to close this issue out.