question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

k8s controller: DaskCluster's replicas lowered, worker pods not deleted

See original GitHub issue

Issue summary

I ended up with active but unused worker pods, unusable by the scheduler, blocking creation of new pods if a DaskCluster needs to scale up, but unable to scale down if a DaskCluster reduced its replica count.

My understanding

  • my_cluster.get_client() return a dask.distributed.Client instance.
  • The client instance is connected to the associated cluster’s scheduler and gets its information about workers from the scheduler.
  • The scheduler is quick to adapt and scale down, and DaskCluster often quickly follow.
  • The DaskCluster resource will be able to adapt upwards, but if the worker had previously been scaled away, it may end up being unused and unregistered by the scheduler.

Original report extracted from #246

GatewayCluster widget inconsistent with actual pods

After my images had been pulled and pods could start quickly, I reset my state by deleting clusters etc. Then:

  1. I created a new cluster and choose to adapt 0-5 workers in it.
  2. I ran a job with a client to the cluster that finished in 20 seconds and I had 5 workers according to the client for a while, and then 0 again.
  3. I observed that I still had 5 pods, and my DaskCluster had 5 replicas.
    • The controller is in my mind thereby doing its job, it ensures 5 replicas of the workers.
    • There is something wrong though, because the scheduler knew to delete a worker, but that didn’t lead to a change in the DaskCluster resource, so the controller didn’t remove any pods.
  4. I decided to try run my job again, would it add five new pods? It turns out no, it would instead error with a Timeout like if my workers failed to get started fast enough.
  5. I tried adapting to 6/6, that added one worker, and the client could observe 1 worker while there was 6 replicas according to the DaskCluster resource and I saw 6 pods.
  6. I ran my workload, and ended up doing work using a single worker.
  7. I tried adapting to 0-3 workers, but kept seeing 6 pods, and this were the logs of the controller, while the DaskCluster resource were updated to 1 replica.
    [D 2020-04-14 17:33:11.857 KubeController] Event - MODIFIED cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
    [I 2020-04-14 17:33:11.857 KubeController] Reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
    [I 2020-04-14 17:33:11.862 KubeController] Finished reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
    
  8. I tried running my workload, and it seems that my client never registered more than one replica at best.

This may be two separate issues. Hmm… or not?

  1. The controller fail to align correctly with the DaskCluster resource and never successfully delete surplus pods what the DaskCluster resource indicate it doesn’t need.
  2. Workers that the scheduler has used once isn’t reused later if adapted away, except perhaps one single worker.

Cluster adapt 0-5 can trigger as if it were 1-5

If I have a fresh cluster and press adapt 0-5, it doesn’t create a worker for me, but I have ended up in a state where just going from 0-0 to 0-5 would add back a replica in the DaskCluster resource. I think the scheduled ended up thinking it kept needing one. This is a state I failed to reproduce quickly with a new cluster.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:18 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
consideRatiocommented, May 4, 2020
$ pip list | grep distributed
distributed                   2.15.2
1reaction
jcristcommented, May 4, 2020

According to https://stackoverflow.com/a/59658670, the restartCount is computed according to the number of dead containers that have yet to be cleaned up. If the kubelet is cleaning up containers, it will cause that number to be reset.

@jcrist, ah the scheduler creates the pods, and that is dask.distributed that owns the scheduler? Ah.

No, sorry. dask-gateway creates and manages the scheduler and worker pods, but once a pod is created we only obseve it, we don’t ever update a created pod (e.g. for a restart). The k8s pod controller is responsible for handling that. dask.distributed doesn’t know or do anything with k8s.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask Kubernetes Operator
Autoscaling is handled by a custom Kubernetes controller instead of the user code; Scheduler and worker Pods and Services are fully configurable.
Read more >
Stale scheduler/workers on a deployment using Dask Gateway
This deployment uses Dask Gateway to deploy clusters and send ... k8s controller: DaskCluster's replicas lowered, worker pods not deleted.
Read more >
ReplicationController - Kubernetes
To delete a ReplicationController and all its pods, use kubectl delete . Kubectl will scale the ReplicationController to zero and wait for it...
Read more >
Dask Kubernetes Environment - Prefect Docs
The worker spec has replicas: 2 which means that on creation of the Dask cluster there will be two worker pods for executing...
Read more >
Using Dask on KubeFlow with the Dask Kubernetes Operator
Dask's schedulers scale to thousand-node clusters but also work just fine ... or the Jupyter pod that KubeFlow provides we can use the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found