[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state
See original GitHub issueDescribe the bug
A clear and concise description of what the problem is.
We have a prefect-agent
pod running and have registered a few flows with it. As we have been working to get these flows working as expected (currently also experiencing dask gateway issues), we have been running into issues with the cluster. As part of the Prefect flow testing, I will manually start a flow-run from the Prefect cloud console and this action results in many critical pods being evicted and an additional general
node being created, splitting the qhub cluster pods between these two general nodes.
Here is the most accurate timeline I have yet documented:
-
Kick off Prefect flow run from the Prefect cloud console
- This flow was registered with the Prefect agent using a GitHub Action workflow
- The prefect flow image is stored in ECR and is quite large at around ~9 GB
-
A
prefect-job
pod spins up on thegeneral
node and after a few minutes, the pod fails and falls into acrashLoopBackOff
state with the following error messages:β Type Reason Age From Message β β ---- ------ ---- ---- ------- β β Normal Scheduled 9m2s default-scheduler Successfully assigned dev/prefect-job-beae9939-znksf to ip-10-10-19 β β -99.us-east-2.compute.internal β β Warning Failed 6m32s kubelet, ip-10-10-19-99.us-east-2.compute.internal Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ β β prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op β β t/conda/pkgs/kubernetes-server-1.21.2-h77c71de_0.tar.bz2: no space left on device β β Warning Failed 3m54s kubelet, ip-10-10-19-99.us-east-2.compute.internal Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ β β prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op β β t/conda/pkgs/libclang-11.1.0-default_ha53f305_1.tar.bz2: no space left on device β β Warning Failed 53s kubelet, ip-10-10-19-99.us-east-2.compute.internal Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ β β prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op β β t/conda/pkgs/libgdal-3.2.1-h38ff51b_7/lib/libgdal.so.28.0.1: no space left on device β β Warning Failed 53s (x3 over 6m32s) kubelet, ip-10-10-19-99.us-east-2.compute.internal Error: ErrImagePull β β Normal BackOff 25s (x4 over 6m30s) kubelet, ip-10-10-19-99.us-east-2.compute.internal Back-off pulling image "434027708253.dkr.ecr.us-east-2.amazonaws.co β β m/prefect-workflows/datum-flows:1624894027929062792"
-
At this point, many of the other pods on the
general
node start being quickly evicted- These include the
user-scheduler
,qhub-traefik-ingress
,qhub-jupyterhub-ssh
andproxy
pods - There appeared to be a dozen or so evicted
user-scheduler
pods - Most these pods have similar event log messages:
| Type Reason Age From Message β β ---- ------ ---- ---- ------- β β Warning Evicted 8m40s kubelet, ip-10-10-19-99.us-east-2.compute.internal The node was low on resource: ephemeral-storage. Container user-schedul β β er was using 33228Ki, which exceeds its request of 0. β β Normal Killing 8m40s kubelet, ip-10-10-19-99.us-east-2.compute.internal Stopping container user-scheduler β β Warning ExceededGracePeriod 8m30s kubelet, ip-10-10-19-99.us-east-2.compute.internal Container runtime did not kill the pod within specified grace period.
- Some of the
user-scheduler
pods have this event message:
β Type Reason Age From Message β β ---- ------ ---- ---- ------- β β Normal Scheduled 16m default-scheduler Successfully assigned dev/user-scheduler-57959bddf7-m9b6x to ip-10-10-19-99.us-eas β β t-2.compute.internal β β Warning Evicted 16m kubelet, ip-10-10-19-99.us-east-2.compute.internal The node had condition: [DiskPressure].
- These include the
-
Then the
conda-store
andhub
pods get evicted and have trouble coming back online.conda-store
andhub
pod event messages:
β Type Reason Age From Message β β ---- ------ ---- ---- ------- β β Normal NotTriggerScaleUp 79s (x52 over 11m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match β β node selector, 1 max node group size reached β β Warning FailedScheduling 29s (x11 over 11m) default-scheduler 0/3 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {node.kub β β ernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had volume node affinity conflict.
After documenting all of the above, I waited a few minutes to see if the cluster would gracefully correct itself but ultimately had to manually kill all of the evicted pods and manually kill the prefect-job
job. And to get back down to one general
node, I drained the new general
node and force killed any pods that didnβt want to be evicted. This resulted in a stable cluster.
After doing a bit of digging around and reading online, our hypothesis is that the attached block storage got overwhelmed by the size of the prefect-job
image and was forced to evict many of the other pods. There are currently three block-store volumes attached to the general
node:
- 100 GB for the conda-store
- 20 GB (unknown purpose)
- 1 GB (unknown purpose)
The next step for us is to reduce the size of the prefect-job
image, however we were also wondering if we could (or should) increase the size of the block-stores attached to the general
node.
How can we help?
Help us help you.
- What are you trying to achieve?
- Register a flow with Prefect agent and have a
prefect-job
spin up without causing trouble for the cluster
- Register a flow with Prefect agent and have a
- How can we reproduce the problem?
- Yes, Iβd be happy to show how I got to these logs.
Your environment
Describe the environment in which you are experiencing the bug.
Include your conda version (use
conda --version
), k8s and any other relevant details.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:8 (8 by maintainers)
Here is an example of implementing something like that: https://github.com/Quansight/qhub/pull/604
@viniciusdc we have redeployed a few times and, today, I ran several prefect flows without any issue. Thank you following up! Close this issue.