Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Kubernetes Cluster AutoScaler can result in failed Flow Runs

See original GitHub issue

Description

This only happens on K8 clusters with Cluster Autoscaler to dynamically provision the K8 nodes based on current workload demands in the cluster. In my case, I’ve observed this on AWS EKS with manual NodeGroups.

The Cluster Autoscaler will evaluate the utilization of each Node in the K8 cluster to determine if it can scale down the number of Nodes. By default this happens every 10 minutes (configurable by --scale-down-unneeded-time flag), and is determined by a number of factors, the main one being the sum of cpu and memory requests of all pods running on this node is smaller than 50% of the node’s allocatable (configurable by --scale-down-utilization-threshold). There’s also a back-off period after a Node is first added to the cluster of 10 minutes (configurable by --scale-down-delay-after-add flag).

This functionality of K8 assumes that all Pods created by a controller object (deployment, replica set, job, stateful set etc) are ephemeral, and can be relocated to a different Node, with zero impact to the functionality of those Pods. The Prefect Kubernetes Agent triggers Flow Runs via K8 Jobs (effectively declaring them as ephemeral), which spawn the Pods to perform the actual execution.

The problem is intermittent, and only comes into play under the following situation:

A new, long-running (i.e. >21 minutes), Prefect Flow Run is scheduled, and picked up by a Prefect Agent running in K8 w/Cluster Autoscaler
Prefect Agent creates a K8 Job, but there are no available K8 Nodes to run this Job within 1-5 minutes
Cluster Autoscaler scales up the Nodes in K8. The Node it scales up has at least 2x cpu/memory allocatable than what the single Prefect Flow Run would consume.
The Prefect Flow Run is assigned to the new Node Cluster Autoscaler just started
Capacity frees up on the other Nodes in the K8 cluster, such that no other Pods are scheduled on the new Node the Cluster Autoscaler just started (in step 3)
After 20 minutes, the Cluster Autoscaler determines that the Node from step 3 is a candidate to scale down because it meets the criteria (i.e. sum of cpu/memory of all pods is smaller than 50%, the pods were created by a controller object, etc.)
The Pod (which is executing the Flow Run) is evicted from this Node and rescheduled to a different Node in K8, and the Cluster Autoscaler Node is removed from K8

Now the fun part:

Because K8 evicted the Pod, Prefect Cloud will attempt to keep track of the original Pod (which no longer exists)
When the evicted Pod starts again, it will begin the Flow Run execution again … because this was a long-running Flow Run, it’ll take it >21 minutes to complete
Eventually Prefect will encounter No heartbeat detected from the remote task; marking the run as failed. for the initial Pod it created, which was since relocated by the Cluster AutoScaler.
You can observe the newly started Pod is actually reporting logs to Prefect Cloud, so it’s curious why it detected no heartbeat from that new process. I’m guessing it’s looking for some metadata to identify a heartbeat from the original Pod IP.
Because no heartbeat was detected from that original Pod, the Lazarus process will attempt to reschedule the Flow Run … but the Flow Run was already restarted by K8, it’s just in a different Pod now
This confusion will result in the Flow Run being marked as “Failed”

While less likely, it’s feasible for a Flow Run to be scheduled on a Node that has been up for awhile, but is a candidate to be scaled down in the next 1 minute. Then after ~1 minute of executing the Flow Run on this Node, capacity on other Nodes could free up, such that the Cluster AutoScaler can determine to relocate all the running Pods, which would result in the same type of failure. In this case, the failure would happen to any Flow Run, regardless of execution time.

Expected Behavior

The root of the problem is that the Cluster AutoScaler is moving the Pod created by the Job the Prefect Agent generated for this Flow Run, before that Flow Run had time to complete. It seems the scheduler may not be capable of handling situations where the Flow Run is restarted by processes outside of it’s control.

A simple way to prevent the Cluster AutoScaler from attempting to evict a Pod from a Node, would be to add this annotation to it’s manifest:

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

If the AutoScaler see’s this annotation on a Pod, it will not consider the Node for scaling down. In theory you could add this to a job_spec_file, but it’d be exhausting to do this for every Flow you want to execute in a K8 w/Cluster AutoScaler.

Reproduction

There’s a public flow that simply sleeps for a few minutes: https://hub.docker.com/repository/docker/szelenka/long-running-flow

In AWS, I have an EKS cluster with a simple NodeGroup of t3.2xlarge, where the Prefect Agent spawns Jobs with 2 vCPU and 2G RAM. Initially, we only have the Prefect Agent running on a single Node in K8.

Through Prefect Cloud, we start 3 Flow Runs to run at 5 minutes each. This fills up the capacity of the single Node in K8.

Then we start another Flow Run, with a run time of 25 minutes. This causes the Cluster AutoScaler to scale up a Node in K8, and assign this Job to that new Node.

Because it takes longer than 20 minutes to complete, and the other Flow Runs have completed, it will be evicted and relocated to the other Node that capacity freed up on, and trigger the above failure.

Environment

{
  "config_overrides": {
    "cloud": {
      "use_local_secrets": true
    },
    "context": {
      "secrets": false
    }
  },
  "env_vars": [],
  "system_information": {
    "platform": "macOS-10.15.6-x86_64-i386-64bit",
    "prefect_version": "0.12.6",
    "python_version": "3.8.1"
  }
}

Issue Analytics

State:
Created 3 years ago
Reactions:9
Comments:20 (3 by maintainers)

Top GitHub Comments

1reaction

joshmeekcommented, Jul 30, 2020

@szelenka Thanks for a well written issue! I think the best bet here would be to add the eviction policy option setting to the agent so you could say something along the lines of:

prefect agent install/start kubernetes --safe-to-evict=False

And then all created jobs will have the annotation

"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

0reactions

github-actions[bot]commented, Dec 4, 2022

This issue was closed because it has been stale for 14 days with no activity. If this issue is important or you have more to add feel free to re-open it.

Top Results From Across the Web

Cluster Autoscaler: How It Works and Solving Common ...

ScaleDown —this event means CA is scaling down the node. There can be multiple events, indicating different stages of the scale-down operation. ·...

Autoscaling in Kubernetes

The autoscaler will try to add new nodes to the cluster if there are pending pods which could schedule on a new node....

Using the Kubernetes Cluster Autoscaler - Oracle Help Center

You can use the Kubernetes Cluster Autoscaler to automatically resize a cluster's node pools based on application workload demands.

Applying autoscaling to an OpenShift Container Platform cluster

The cluster autoscaler increases the size of the cluster when there are pods that fail to schedule on any of the current worker...

Architecting Kubernetes clusters — choosing the best ...

TL;DR: Scaling pods and nodes in a Kubernetes cluster could take several ... How the Cluster Autoscaler works in Kubernetes; Exploring pod autoscaling...