Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Map tasks running on SPOT instances - pods stuck in `terminating` state forever

See original GitHub issue

Describe the bug

When map tasks are run on SPOT instances, when the node dies the pod stays in a “Terminating” state forever (but it never actually completes terminating and erroring out). This causes the subtask to think it is still “running” forever (instead of retrying). But seems to work correctly for other types of Flyte tasks.

Expected behavior

If a SPOT instance is not available, the sub task should terminate gracefully and retry for the specified number of retries.

Additional context to reproduce

Run a heavy map task with around ~100 sub tasks (using K8S array) which requires GPU instances on AWS.
Change AWS Autoscaling group configuration to support on-demand instances initially.
When some map sub tasks have SUCCEEDED, proceed to switch from on-demand to SPOT instances from Auto-scaling group console.
You will find some pods getting stuck at terminating state.

Screenshots

Pods list :

vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pods -n dev | grep a8mwjq5z94p55fxhk9zl
a8mwjq5z94p55fxhk9zl-n2-0-0              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-1              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-11             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-13             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-19             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-2              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-24             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-27             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-3              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-30             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-31             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-34             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-36             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-37             0/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-4              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-43             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-49             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-5              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-50             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-53             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-54             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-56             1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-57             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-6              0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-7              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-72             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-76             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-8              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-82             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-86             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-87             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-9              1/1     Terminating        0          9h
a8mwjq5z94p55fxhk9zl-n2-0-91             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-92             0/1     Completed          0          9h
a8mwjq5z94p55fxhk9zl-n2-0-94             0/1     Completed          0          9h

‘Terminating’ pod information:

vijay.jaishankervijay@MacBook-Pro ~ % kubectl get pod a8mwjq5z94p55fxhk9zl-n2-0-27 -n dev -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    flyte.lyft.com/deployment: flyte-l5
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2022-07-19T05:47:08Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2022-07-19T05:59:45Z"
  finalizers:
  - flyte/array
  labels:
    domain: dev
    execution-id: a8mwjq5z94p55fxhk9zl
    interruptible: "false"
    manager: avora
    node-id: n2
    owner-email: mtoledo
    owner-name: mtoledo
    platform: flyte
    project: avdelorean
    shard-key: "21"
    task-name: src-backend-delorean-delorean-map-base-mapper-run-map-task-0
    team: compute-infra
    workflow-name: src-planning-lib-prediction-metrics-prediction-metrics-processo
  name: a8mwjq5z94p55fxhk9zl-n2-0-27
  namespace: dev
  ownerReferences:
  - apiVersion: flyte.lyft.com/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: flyteworkflow
    name: a8mwjq5z94p55fxhk9zl
    uid: 0d0a8d7c-f935-437c-b339-d003c7643827
  resourceVersion: "9478565284"
  uid: 78bb2022-c7d6-4f47-9832-d12656cbdb2c
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: l5.lyft.com/pool
            operator: In
            values:
            - eks-pdx-pool-gpu
  containers:
  - args:
    - pyflyte-map-execute
    - --inputs
    - s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/inputs.pb
    - --output-prefix
    - s3://lyft-av-prod-pdx-flyte/metadata/propeller/production/avdelorean-dev-a8mwjq5z94p55fxhk9zl/n2/data/0
    - --raw-output-data-prefix
    - s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0
    - --checkpoint-path
    - s3://lyft-av-prod-pdx-flyte/raw_data/3r/a8mwjq5z94p55fxhk9zl-n2-0/27/0/_flytecheckpoints
    - --prev-checkpoint
    - '""'
    - --resolver
    - flytekit.core.python_auto_container.default_task_resolver
    - --
    - task-module
    - src.backend.delorean.delorean_map_base
    - task-name
    - run_map_task
    env:
    - name: FLYTE_INTERNAL_EXECUTION_WORKFLOW
      value: avdelorean:dev:src.planning.lib.prediction.metrics.prediction_metrics_processor_wfe_map_class.PredictionMetricsProcessorMapWorkflowPerfTest
    - name: FLYTE_INTERNAL_EXECUTION_ID
      value: a8mwjq5z94p55fxhk9zl
    - name: FLYTE_INTERNAL_EXECUTION_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_EXECUTION_DOMAIN
      value: dev
    - name: FLYTE_ATTEMPT_NUMBER
      value: "0"
    - name: FLYTE_INTERNAL_TASK_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_TASK_DOMAIN
      value: dev
    - name: FLYTE_INTERNAL_TASK_NAME
      value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
    - name: FLYTE_INTERNAL_TASK_VERSION
      value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    - name: FLYTE_INTERNAL_PROJECT
      value: avdelorean
    - name: FLYTE_INTERNAL_DOMAIN
      value: dev
    - name: FLYTE_INTERNAL_NAME
      value: src.backend.delorean.delorean_map_base.mapper_run_map_task_0
    - name: FLYTE_INTERNAL_VERSION
      value: b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    - name: KUBERNETES_REQUEST_TIMEOUT
      value: "100000"
    - name: L5_BASE_DOMAIN
      value: l5.woven-planet.tech
    - name: AWS_METADATA_SERVICE_NUM_ATTEMPTS
      value: "20"
    - name: AWS_METADATA_SERVICE_TIMEOUT
      value: "5"
    - name: FLYTE_STATSD_HOST
      value: flyte-telegraf.infrastructure
    - name: KUBERNETES_CLUSTER_NAME
      value: pdx
    - name: FLYTE_K8S_ARRAY_INDEX
      value: "27"
    - name: BATCH_JOB_ARRAY_INDEX_VAR_NAME
      value: FLYTE_K8S_ARRAY_INDEX
    - name: L5_DATACENTER
      value: pdx
    - name: L5_ENVIRONMENT
      value: pdx
    - name: RUNTIME_POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: RUNTIME_POD_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.podIP
    - name: RUNTIME_NODE_IP
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: status.hostIP
    - name: L5_NAMESPACE
      value: dev
    image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    imagePullPolicy: IfNotPresent
    name: a8mwjq5z94p55fxhk9zl-n2-0-27
    resources:
      limits:
        cpu: "4"
        memory: 56Gi
        nvidia.com/gpu: "1"
      requests:
        cpu: "4"
        memory: 56Gi
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-c459j
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: ip-10-162-107-6.us-west-2.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Never
  schedulerName: flyte-scheduler
  securityContext:
    fsGroup: 65534
  serviceAccount: avdelorean-dev
  serviceAccountName: avdelorean-dev
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: lyft.com/gpu
    operator: Equal
    value: dedicated
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: nvidia.com/gpu
    operator: Exists
  volumes:
  - name: kube-api-access-c459j
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:35Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:54:40Z"
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:37Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2022-07-19T05:51:35Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://e07e4f0cec8265cd15c57833782c305e5581c9d463a97af8307ce3c40bd2c324
    image: ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test:b4497e9ee9c7ab22671e035b9fba3a3ec2a06f7b
    imageID: docker-pullable://ephemeral-docker.pdx.l5.woven-planet.tech/application/workflows/avdelorean/prediction_metrics_processor_cloud_map_wfe_perf_test@sha256:172ecf248838b1ec88e520528f0125451043769fb31c26d0bfc55057c98afabf
    lastState: {}
    name: a8mwjq5z94p55fxhk9zl-n2-0-27
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-07-19T05:51:37Z"
  hostIP: 10.162.107.6
  phase: Running
  podIP: 10.162.72.241
  podIPs:
  - ip: 10.162.72.241
  qosClass: Guaranteed
  startTime: "2022-07-19T05:51:35Z"

Are you sure this issue hasn’t been raised already?

Have you read the Code of Conduct?

Issue Analytics

State:
Created a year ago
Comments:14 (7 by maintainers)

Top GitHub Comments

1reaction

hamersawcommented, Jul 21, 2022

TL;DR I think the reproduce steps are explainable, but we may want to add a flag making it work as intended. It would help to have more information to debug this (Pod info on a SPOT instance).

OK, I’ve had some time to explore this in depth, a few things:

With this PR we updated map tasks to use the same k8s execution code as regular k8s tasks, so retries / etc should be handled identically. It’s not clear to me why this would work on regular tasks and not map tasks.
AFAIK there are no settings where we can stop a SPOT instance from deleting a Pod. We set the pod finalizer to stop k8s from garbage collecting completed Pods before Flyte has the ability to detect them. But per the documentation, SPOT instances will attempt to delete a Pod gracefully (with the delete API) and then wait at least 2 minutes before killing it, regardless of finalizers / etc.
The steps described to reproduce this are explainable. We deploy a Pod on an on-demand instance (with finalizer flyte/array set) then when the autoscaler attempts to transition it to a SPOT instance it can not delete the Pod because it has a finalizer, so it is stuck in the terminating state. This is a corner-case that we should probably handle. I think we could add an option, like permit-external-deletions or something, that allows non-terminal Pods to be deleted. There are other conceivable situations where this would be useful.
I also noticed the Pod in the description has false for the interruptible label. Is the task being mapped setup as interruptible? I’m not sure this relates here, but setting up an interruptible task is meant to inject tolerations / etc to allow execution on SPOT instances.