Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[enhancement] Running a Prefect flow results in pod evictions and potentially a bad state

See original GitHub issue

Describe the bug

A clear and concise description of what the problem is.

We have a prefect-agent pod running and have registered a few flows with it. As we have been working to get these flows working as expected (currently also experiencing dask gateway issues), we have been running into issues with the cluster. As part of the Prefect flow testing, I will manually start a flow-run from the Prefect cloud console and this action results in many critical pods being evicted and an additional general node being created, splitting the qhub cluster pods between these two general nodes.

Here is the most accurate timeline I have yet documented:

Kick off Prefect flow run from the Prefect cloud console
- This flow was registered with the Prefect agent using a GitHub Action workflow
- The prefect flow image is stored in ECR and is quite large at around ~9 GB

A prefect-job pod spins up on the general node and after a few minutes, the pod fails and falls into a crashLoopBackOff state with the following error messages:

│   Type     Reason     Age                  From                                                Message                                                             │
│   ----     ------     ----                 ----                                                -------                                                             │
│   Normal   Scheduled  9m2s                 default-scheduler                                   Successfully assigned dev/prefect-job-beae9939-znksf to ip-10-10-19 │
│ -99.us-east-2.compute.internal                                                                                                                                     │
│   Warning  Failed     6m32s                kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
│ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
│ t/conda/pkgs/kubernetes-server-1.21.2-h77c71de_0.tar.bz2: no space left on device                                                                                  │
│   Warning  Failed     3m54s                kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
│ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
│ t/conda/pkgs/libclang-11.1.0-default_ha53f305_1.tar.bz2: no space left on device                                                                                   │
│   Warning  Failed     53s                  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Failed to pull image "434027708253.dkr.ecr.us-east-2.amazonaws.com/ │
│ prefect-workflows/datum-flows:1624894027929062792": rpc error: code = Unknown desc = failed to register layer: Error processing tar file(exit status 1): write /op │
│ t/conda/pkgs/libgdal-3.2.1-h38ff51b_7/lib/libgdal.so.28.0.1: no space left on device                                                                               │
│   Warning  Failed     53s (x3 over 6m32s)  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Error: ErrImagePull                                                 │
│   Normal   BackOff    25s (x4 over 6m30s)  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Back-off pulling image "434027708253.dkr.ecr.us-east-2.amazonaws.co │
│ m/prefect-workflows/datum-flows:1624894027929062792"

At this point, many of the other pods on the general node start being quickly evicted

These include the user-scheduler, qhub-traefik-ingress, qhub-jupyterhub-ssh and proxy pods
There appeared to be a dozen or so evicted user-scheduler pods
Most these pods have similar event log messages:

|   Type     Reason               Age    From                                                Message                                                                 │
│   ----     ------               ----   ----                                                -------                                                                 │
│   Warning  Evicted              8m40s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  The node was low on resource: ephemeral-storage. Container user-schedul │
│ er was using 33228Ki, which exceeds its request of 0.                                                                                                              │
│   Normal   Killing              8m40s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Stopping container user-scheduler                                       │
│   Warning  ExceededGracePeriod  8m30s  kubelet, ip-10-10-19-99.us-east-2.compute.internal  Container runtime did not kill the pod within specified grace period.

Some of the user-scheduler pods have this event message:

│   Type     Reason     Age   From                                                Message                                                                            │
│   ----     ------     ----  ----                                                -------                                                                            │
│   Normal   Scheduled  16m   default-scheduler                                   Successfully assigned dev/user-scheduler-57959bddf7-m9b6x to ip-10-10-19-99.us-eas │
│ t-2.compute.internal                                                                                                                                               │
│   Warning  Evicted    16m   kubelet, ip-10-10-19-99.us-east-2.compute.internal  The node had condition: [DiskPressure].

Then the conda-store and hub pods get evicted and have trouble coming back online.

conda-store and hub pod event messages:

│   Type     Reason             Age                 From                Message                                                                                      │
│   ----     ------             ----                ----                -------                                                                                      │
│   Normal   NotTriggerScaleUp  79s (x52 over 11m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 node(s) didn't match │
│  node selector, 1 max node group size reached                                                                                                                      │
│   Warning  FailedScheduling   29s (x11 over 11m)  default-scheduler   0/3 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taint {node.kub │
│ ernetes.io/disk-pressure: }, that the pod didn't tolerate, 1 node(s) had volume node affinity conflict.

After documenting all of the above, I waited a few minutes to see if the cluster would gracefully correct itself but ultimately had to manually kill all of the evicted pods and manually kill the prefect-job job. And to get back down to one general node, I drained the new general node and force killed any pods that didn’t want to be evicted. This resulted in a stable cluster.

After doing a bit of digging around and reading online, our hypothesis is that the attached block storage got overwhelmed by the size of the prefect-job image and was forced to evict many of the other pods. There are currently three block-store volumes attached to the general node:

100 GB for the conda-store
20 GB (unknown purpose)
1 GB (unknown purpose)

The next step for us is to reduce the size of the prefect-job image, however we were also wondering if we could (or should) increase the size of the block-stores attached to the general node.

How can we help?

Help us help you.

What are you trying to achieve?
- Register a flow with Prefect agent and have a prefect-job spin up without causing trouble for the cluster
How can we reproduce the problem?
- Yes, I’d be happy to show how I got to these logs.

Your environment

Describe the environment in which you are experiencing the bug.

Include your conda version (use conda --version), k8s and any other relevant details.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:8 (8 by maintainers)

Top GitHub Comments

1reaction

aktechcommented, Jul 1, 2021

Here is an example of implementing something like that: https://github.com/Quansight/qhub/pull/604

0reactions

iameskildcommented, Aug 16, 2021

@viniciusdc we have redeployed a few times and, today, I ran several prefect flows without any issue. Thank you following up! Close this issue.

Top Results From Across the Web

[enhancement] Running a Prefect flow results in pod evictions ...

We have a prefect-agent pod running and have registered a few ... a Prefect flow results in pod evictions and potentially a bad...

Flow and task configuration - Coordinating the world's dataflows

Task results are cached in memory during a flow run and persisted to the location specified by the PREFECT_LOCAL_STORAGE_PATH setting. As a result,...

Untitled

... backyard bacon bacteria bacterial bactericidal bacteriological bacteriologist bacteriologists bacteriology bacteriophage bacterium bad baddy bade bader ...

Dictionary

... bacteriophage bacteriostasis bacterium bacterize bacteroid bactria Bactrian baculiform baculum bad badajoz badalona badderlocks baddie bade baden Baden ...

the of to a and in that is for on it with as was he his but at are

... say dont since through think going very house me did day off such good state own american take against being both thats...