[BUG] Map tasks running on SPOT instances - pods stuck in `terminating` state forever
See original GitHub issueDescribe the bug
When map tasks are run on SPOT instances, when the node dies the pod stays in a “Terminating” state forever (but it never actually completes terminating and erroring out). This causes the subtask to think it is still “running” forever (instead of retrying). But seems to work correctly for other types of Flyte tasks.
Expected behavior
If a SPOT instance is not available, the sub task should terminate gracefully and retry for the specified number of retries.
Additional context to reproduce
- Run a heavy map task with around ~100 sub tasks (using K8S array) which requires GPU instances on AWS.
- Change AWS Autoscaling group configuration to support on-demand instances initially.
- When some map sub tasks have SUCCEEDED, proceed to switch from on-demand to SPOT instances from Auto-scaling group console.
- You will find some pods getting stuck at
terminating
state.
Screenshots
Pods list :
vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pods -n dev | grep a8mwjq5z94p55fxhk9zl
a8mwjq5z94p55fxhk9zl-n2-0-0 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-1 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-11 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-13 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-19 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-2 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-24 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-27 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-3 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-30 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-31 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-34 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-36 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-37 0/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-4 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-43 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-49 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-5 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-50 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-53 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-54 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-56 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-57 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-6 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-7 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-72 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-76 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-8 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-82 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-86 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-87 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-9 1/1 Terminating 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-91 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-92 0/1 Completed 0 9h
a8mwjq5z94p55fxhk9zl-n2-0-94 0/1 Completed 0 9h
‘Terminating’ pod information:
vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pod a8mwjq5z94p55fxhk9zl-n2-0-27 -n dev -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
flyte.lyft.com/deployment: flyte-l5
kubernetes.io/psp: eks.privileged
creationTimestamp: "2022-07-19T05:47:08Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2022-07-19T05:59:45Z"
finalizers:
- flyte/array
labels:
domain: dev
execution-id: a8mwjq5z94p55fxhk9zl
interruptible: "false"
manager: avora
node-id: n2
owner-email: mtoledo
owner-name: mtoledo
platform: flyte
project: avdelorean
shard-key: "21"
task-name: src-backend-delorean-delorean-map-base-mapper-run-map-task-0
team: compute-infra
workflow-name: src-planning-lib-prediction-metrics-prediction-metrics-processo
name: a8mwjq5z94p55fxhk9zl-n2-0-27
namespace: dev
ownerReferences:
- apiVersion: flyte.lyft.com/v1alpha1
blockOwnerDeletion: true
controller: true
kind: flyteworkflow
name: a8mwjq5z94p55fxhk9zl
uid: 0d0a8d7c-f935-437c-b339-d003c7643827
resourceVersion: "9478565284"
uid: 78bb2022-c7d6-4f47-9832-d12656cbdb2c
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: l5.lyft.com/pool
operator: In
values:
- eks-pdx-pool-gpu
containers:
- args:
- pyflyte-map-execute
- --inputs
- s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/inputs.pb
- --output-prefix
- s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/0
- --raw-output-data-prefix
- s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0
- --checkpoint-path
- s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0/_flytecheckpoints
- --prev-checkpoint
- '""'
- --resolver
- flytekit.core.python_auto_container.default_task_resolver
- --
- task-module
- src.backend.delorean.delorean_map_base
- task-name
- run_map_task
env:
- name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
value: avdelorean:dev:src.planning.lib.prediction.metrics.prediction_metrics_processor_wfe_map_class.PredictionMetricsProcessorMapWorkflowPerfTest
- name: FLYTE_INTERNAL_EXECUTION_ID
value: a8mwjq5z94p55fxhk9zl
- name: FLYTE_INTERNAL_EXECUTION_PROJECT
value: avdelorean
- name: FLYTE_INTERNAL_EXECUTION_DOMAIN
value: dev
- name: FLYTE_ATTEMPT_NUMBER
value: "0"
- name: FLYTE_INTERNAL_TASK_PROJECT
value: avdelorean
- name: FLYTE_INTERNAL_TASK_DOMAIN
value: dev
- name: FLYTE_INTERNAL_TASK_NAME
value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
- name: FLYTE_INTERNAL_TASK_VERSION
value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
- name: FLYTE_INTERNAL_PROJECT
value: avdelorean
- name: FLYTE_INTERNAL_DOMAIN
value: dev
- name: FLYTE_INTERNAL_NAME
value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
- name: FLYTE_INTERNAL_VERSION
value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
- name: KUBERNETES_REQUEST_TIMEOUT
value: "100000"
- name: L5_BASE_DOMAIN
value: l5.woven-planet.tech
- name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
value: "20"
- name: AWS_METADATA_SERVICE_TIMEOUT
value: "5"
- name: FLYTE_STATSD_HOST
value: flyte-telegraf.infrastructure
- name: KUBERNETES_CLUSTER_NAME
value: pdx
- name: FLYTE_K8S_ARRAY_INDEX
value: "27"
- name: BATCH_JOB_ARRAY_INDEX_VAR_NAME
value: FLYTE_K8S_ARRAY_INDEX
- name: L5_DATACENTER
value: pdx
- name: L5_ENVIRONMENT
value: pdx
- name: RUNTIME_POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: RUNTIME_POD_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: RUNTIME_NODE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: L5_NAMESPACE
value: dev
image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
imagePullPolicy: IfNotPresent
name: a8mwjq5z94p55fxhk9zl-n2-0-27
resources:
limits:
cpu: "4"
memory: 56Gi
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 56Gi
nvidia.com/gpu: "1"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: FallbackToLogsOnError
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-c459j
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-162-107-6.us-west-2.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Never
schedulerName: flyte-scheduler
securityContext:
fsGroup: 65534
serviceAccount: avdelorean-dev
serviceAccountName: avdelorean-dev
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: lyft.com/gpu
operator: Equal
value: dedicated
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
volumes:
- name: kube-api-access-c459j
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:51:35Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:54:40Z"
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:51:37Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2022-07-19T05:51:35Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://e07e4f0cec8265cd15c57833782c305e5581c9d463a97af8307ce3c40bd2c324
image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
imageID: docker-pullable://ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test@sha256:172ecf248838b1ec88e520528f0125451043769fb31c26d0bfc55057c98afabf
lastState: {}
name: a8mwjq5z94p55fxhk9zl-n2-0-27
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2022-07-19T05:51:37Z"
hostIP: 10.162.107.6
phase: Running
podIP: 10.162.72.241
podIPs:
- ip: 10.162.72.241
qosClass: Guaranteed
startTime: "2022-07-19T05:51:35Z"
Are you sure this issue hasn’t been raised already?
- Yes
Have you read the Code of Conduct?
- Yes
Issue Analytics
- State:
- Created a year ago
- Comments:14 (7 by maintainers)
Top Results From Across the Web
[BUG] Map tasks running on SPOT instances - pods stuck in ...
When map tasks are run on SPOT instances, when the node dies the pod stays in a "Terminating" state forever (but it never...
Read more >Pods stuck in terminating state after AMI amazon-eks-node ...
What happened: Since upgrading to AMI 1.16.15-20201112 (from 1.16.13-20201007), we see a lot of Pods get stuck in Terminating state.
Read more >Pod Stuck In Terminating State – Runbooks - GitHub Pages
This runbook matches if pods have been deleted and remain in a Terminated state for a long time, or a time longer than...
Read more >Pods stuck in Terminating status - kubernetes - Stack Overflow
Historical answer -- There was an issue in version 1.1 where sometimes pods get stranded in the Terminating state if their nodes are...
Read more >Troubleshoot pod status in Amazon EKS - AWS
Pods in the Pending state can't be scheduled onto a node. This can occur due to insufficient resources or with the use of...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
TL;DR I think the reproduce steps are explainable, but we may want to add a flag making it work as intended. It would help to have more information to debug this (Pod info on a SPOT instance).
OK, I’ve had some time to explore this in depth, a few things:
finalizer
to stop k8s from garbage collecting completed Pods before Flyte has the ability to detect them. But per the documentation, SPOT instances will attempt to delete a Pod gracefully (with thedelete
API) and then wait at least 2 minutes before killing it, regardless of finalizers / etc.flyte/array
set) then when the autoscaler attempts to transition it to a SPOT instance it can not delete the Pod because it has a finalizer, so it is stuck in theterminating
state. This is a corner-case that we should probably handle. I think we could add an option, likepermit-external-deletions
or something, that allows non-terminal Pods to be deleted. There are other conceivable situations where this would be useful.false
for theinterruptible
label. Is the task being mapped setup as interruptible? I’m not sure this relates here, but setting up an interruptible task is meant to inject tolerations / etc to allow execution on SPOT instances.@convexquad great to hear! Thanks for being so thoroughly descriptive and responsive, it really helps resolve these this quickly!