Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

k8s controller: DaskCluster's replicas lowered, worker pods not deleted

See original GitHub issue

Issue summary

I ended up with active but unused worker pods, unusable by the scheduler, blocking creation of new pods if a DaskCluster needs to scale up, but unable to scale down if a DaskCluster reduced its replica count.

My understanding

my_cluster.get_client() return a dask.distributed.Client instance.
The client instance is connected to the associated cluster’s scheduler and gets its information about workers from the scheduler.
The scheduler is quick to adapt and scale down, and DaskCluster often quickly follow.
The DaskCluster resource will be able to adapt upwards, but if the worker had previously been scaled away, it may end up being unused and unregistered by the scheduler.

Original report extracted from #246

GatewayCluster widget inconsistent with actual pods

After my images had been pulled and pods could start quickly, I reset my state by deleting clusters etc. Then:

I created a new cluster and choose to adapt 0-5 workers in it.
I ran a job with a client to the cluster that finished in 20 seconds and I had 5 workers according to the client for a while, and then 0 again.
I observed that I still had 5 pods, and my DaskCluster had 5 replicas.
- The controller is in my mind thereby doing its job, it ensures 5 replicas of the workers.
- There is something wrong though, because the scheduler knew to delete a worker, but that didn’t lead to a change in the DaskCluster resource, so the controller didn’t remove any pods.
I decided to try run my job again, would it add five new pods? It turns out no, it would instead error with a Timeout like if my workers failed to get started fast enough.
I tried adapting to 6/6, that added one worker, and the client could observe 1 worker while there was 6 replicas according to the DaskCluster resource and I saw 6 pods.
I ran my workload, and ended up doing work using a single worker.

I tried adapting to 0-3 workers, but kept seeing 6 pods, and this were the logs of the controller, while the DaskCluster resource were updated to 1 replica.

[D 2020-04-14 17:33:11.857 KubeController] Event - MODIFIED cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
[I 2020-04-14 17:33:11.857 KubeController] Reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca
[I 2020-04-14 17:33:11.862 KubeController] Finished reconciling cluster dask-gateway.c047a173d45247dd81f232ab60d692ca

I tried running my workload, and it seems that my client never registered more than one replica at best.

This may be two separate issues. Hmm… or not?

The controller fail to align correctly with the DaskCluster resource and never successfully delete surplus pods what the DaskCluster resource indicate it doesn’t need.
Workers that the scheduler has used once isn’t reused later if adapted away, except perhaps one single worker.

Cluster adapt 0-5 can trigger as if it were 1-5

If I have a fresh cluster and press adapt 0-5, it doesn’t create a worker for me, but I have ended up in a state where just going from 0-0 to 0-5 would add back a replica in the DaskCluster resource. I think the scheduled ended up thinking it kept needing one. This is a state I failed to reproduce quickly with a new cluster.

Issue Analytics

State:
Created 3 years ago
Comments:18 (8 by maintainers)

Top GitHub Comments

1reaction

consideRatiocommented, May 4, 2020

$ pip list | grep distributed
distributed                   2.15.2

1reaction

jcristcommented, May 4, 2020

According to https://stackoverflow.com/a/59658670, the restartCount is computed according to the number of dead containers that have yet to be cleaned up. If the kubelet is cleaning up containers, it will cause that number to be reset.

@jcrist, ah the scheduler creates the pods, and that is dask.distributed that owns the scheduler? Ah.

No, sorry. dask-gateway creates and manages the scheduler and worker pods, but once a pod is created we only obseve it, we don’t ever update a created pod (e.g. for a restart). The k8s pod controller is responsible for handling that. dask.distributed doesn’t know or do anything with k8s.