Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Conda-store pods get evicted on AWS

See original GitHub issue

Describe the bug

Seems like conda-store pods get evicted on AWS on a fresh deployment. This is tested with latest main commit: https://github.com/Quansight/qhub/commit/e7992115abfe65fd429999d5a4241e4863b2a85d

Describe on the pod:

│ Events:                                                                                                                                                                                                                                    │
│   Type     Reason            Age                From                Message                                                                                                                                                                │
│   ----     ------            ----               ----                -------                                                                                                                                                                │
│   Warning  FailedScheduling  51s (x2 over 52s)  default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.           │
│   Normal   TriggeredScaleUp  41s                cluster-autoscaler  pod triggered scale-up: [{eks-20bd5579-b270-ddc9-c256-f021f1d7978b 1->2 (max: 5)}]                                                                                     │
│   Warning  FailedScheduling  6s (x2 over 6s)    default-scheduler   0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, │
│  that the pod didn't tolerate, 2 node(s) didn't match Pod's node affinity.                                                                                                                                                                 │
│

Issue Analytics

State:
Created 2 years ago
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

aktechcommented, Jul 28, 2021

I think you would need a new deployment, if the volume node is already spun up in a conflicting zone then its very less likely that it will be moved after updating it to the latest version.

0reactions

iameskildcommented, Jul 28, 2021

That makes sense. Using the AWS console to confirm, the Availability Zones for the general node that these pods were running on was in us-east-2a whereas the 50 GB volume mounts are in `us-east-2b.

To get back to a working state, I drained the general node:

kubectl drain ip-10-10-4-189.us-east-2.compute.internal --ignore-daemonsets --delete-emptydir-data --force

And then I will manually kill any pods that won’t be forced drained. This will put the node in a “cordoned” state and a new node should soon after spin up (and if you’re lucky and the node is launched in the same AZ as your volume mounts), then the pods that were drained will be spun up on the new node.