question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why does the computation on my GKE Dask cluster fail?

See original GitHub issue

Hello!

The cluster:

gcloud container clusters create "cluster-1" --zone "europe-west1-c" --machine-type "e2-standard-4" --num-nodes "2"

gcloud container clusters get-credentials cluster-1 --zone europe-west1-c

worker.yaml:

kind: Pod
metadata:
  labels:
    foo: bar
spec:
  restartPolicy: Never
  containers:
  - image: daskdev/dask:latest
    imagePullPolicy: IfNotPresent
    args: [dask-worker, --nthreads, '1', --no-dashboard, --memory-limit, 2GB, --death-timeout, '60']
    name: dask
    env:
      - name: EXTRA_PIP_PACKAGES
        value: git+https://github.com/dask/distributed
    resources:
      limits:
        cpu: 1
        memory: 2G
      requests:
        cpu: 1
        memory: 2G

Deploying Scheduler and 2 Workers:

>>> from dask_kubernetes import KubeCluster
>>> cluster = KubeCluster('worker.yaml')

Creating scheduler pod on cluster. This may take some time.
Forwarding from 127.0.0.1:53648 -> 8786
Forwarding from [::1]:53648 -> 8786
Handling connection for 53648
Handling connection for 53648
/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found

+-------------+-----------+-----------------------+---------+
| Package     | client    | scheduler             | workers |
+-------------+-----------+-----------------------+---------+
| distributed | 2021.04.0 | 2021.04.0+7.g053f99b8 | None    |
+-------------+-----------+-----------------------+---------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))
Handling connection for 53648

>>> cluster.scale(2)
>>> cluster.workers
{0: <Pod Worker: status=Status.running>, 1: <Pod Worker: status=Status.running>}

>>> print(json.dumps(cluster.scheduler_info, indent=2))
{
  "type": "Scheduler",
  "id": "Scheduler-323084ab-db05-4639-bfb8-803d0d4c92f6",
  "address": "tcp://10.0.0.4:8786",
  "services": {
    "dashboard": 8787
  },
  "started": 1618243652.8047366,
  "workers": {
    "tcp://10.0.0.5:34101": {
      "type": "Worker",
      "id": 1,
      "host": "10.0.0.5",
      "resources": {},
      "local_directory": "/dask-worker-space/worker-u52d79l5",
      "name": 1,
      "nthreads": 1,
      "memory_limit": 1999998976,
      "services": {
        "dashboard": 35195
      },
      "nanny": "tcp://10.0.0.5:35723"
    },
    "tcp://10.0.1.9:37065": {
      "type": "Worker",
      "id": 0,
      "host": "10.0.1.9",
      "resources": {},
      "local_directory": "/dask-worker-space/worker-tjbrsf1d",
      "name": 0,
      "nthreads": 1,
      "memory_limit": 1999998976,
      "services": {
        "dashboard": 38733
      },
      "nanny": "tcp://10.0.1.9:34771"
    }
  }
}

Everything seems to be working fine so far:

$ kubectl get all
NAME                               READY   STATUS    RESTARTS   AGE
pod/dask-raffael-f4ac2841-7ddfn7   1/1     Running   0          5m51s
pod/dask-raffael-f4ac2841-7dgpj8   1/1     Running   0          6m26s
pod/dask-raffael-f4ac2841-7drdth   1/1     Running   0          5m51s

NAME                              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
service/dask-raffael-f4ac2841-7   ClusterIP   10.3.249.221   <none>        8786/TCP,8787/TCP   5m56s
service/kubernetes                ClusterIP   10.3.240.1     <none>        443/TCP             16m

Now I provide a task for the cluster:

>>> import dask.array as da
>>> from dask.distributed import Client
>>> client = Client(cluster)

Handling connection for 62631
Handling connection for 62631
/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found

+-------------+---------------+-----------------------+-----------------------+
| Package     | client        | scheduler             | workers               |
+-------------+---------------+-----------------------+-----------------------+
| distributed | 2021.04.0     | 2021.04.0+7.g053f99b8 | 2021.04.0+7.g053f99b8 |
| numpy       | 1.20.1        | 1.18.1                | 1.18.1                |
| python      | 3.8.8.final.0 | 3.8.0.final.0         | 3.8.0.final.0         |
+-------------+---------------+-----------------------+-----------------------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

>>> x = da.random.random((3000, 3000), chunks=(1000, 1000))
>>> y = x + x.T
>>> y.compute()

Traceback (most recent call last):
Handling connection for 62631
  File "<stdin>", line 1, in <module>
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/dask/base.py", line 284, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/dask/base.py", line 566, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 2666, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 1975, in gather
    return self.sync(
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 843, in sync
    return sync(
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/utils.py", line 353, in sync
    raise exc.with_traceback(tb)
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/utils.py", line 336, in f
    result[0] = yield future
  File "/home/raffael/.local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py", line 1841, in _gather
    raise exc
concurrent.futures._base.CancelledError: ('add-6fdc66ee13ca592f908534ae532b03a0', 2, 2)
>>> Handling connection for 62631
/home/raffael/miniconda3/envs/dask/lib/python3.8/site-packages/distributed/client.py:1140: VersionMismatchWarning: Mismatched versions found

+-------------+---------------+-----------------------+-----------------------+
| Package     | client        | scheduler             | workers               |
+-------------+---------------+-----------------------+-----------------------+
| distributed | 2021.04.0     | 2021.04.0+7.g053f99b8 | 2021.04.0+7.g053f99b8 |
| numpy       | 1.20.1        | 1.18.1                | 1.18.1                |
| python      | 3.8.8.final.0 | 3.8.0.final.0         | 3.8.0.final.0         |
+-------------+---------------+-----------------------+-----------------------+
  warnings.warn(version_module.VersionMismatchWarning(msg[0]["warning"]))

As far as I can tell that should work just fine. Any ideas what is going wrong here?

Thanks

Raffael

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
cardernecommented, Dec 9, 2021

Just writing this here as I don’t have anything worthy of a new issue.

I followed the quickstart again, installing nothing but dask-kubernetes and (apart from the version mismatch warnings) it all works fine.

Sorry for the confusion, clearly something clicked in the interim and I misattributed why it started working.

0reactions
jacobtomlinsoncommented, Dec 9, 2021

@carderne sorry to hear you had troubles getting started.

Running pip install dask-kubernetes should give you everything you need with the right versions. If it doesn’t that’s a bug and we need to fix it. If you are having issues with a fresh install could you please raise a new issue and share steps you took and error messages so we can troubleshoot?

The above issue was caused by a different version of distributed being installed in the docker image than on the client. Given there hasn’t been any update for half a year I’m going to close this issue out.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error connecting to the scheduler · Issue #45 · dask/dask-gke
I got the cluster successfully spawned and it showed a message like ... i actually try to run some compute it does not...
Read more >
Troubleshooting | Google Kubernetes Engine (GKE)
The command fails and displays an error message, usually with HTTP status code 401 (Unauthorized).
Read more >
Dask keeps failing with killed worker exception while running ...
Im running tpot with dask running on kubernetes cluster on gcp, the cluster is 24 cores 120 gb memory with 4 nodes of...
Read more >
Resilience against hardware failures - Dask Examples
Scenario: We have a cluster that partially consists of preemptible ressources. That is, we'll have to deal with workers suddenly being shut down...
Read more >
Scalable Machine Learning with Dask on Google Cloud
Our first step is to set up a Kubernetes Cluster through Google Kubernetes Engine (GKE). ... You should see a button similar to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found