Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[dagster-k8s] Retried k8s job is reported as failure

See original GitHub issue

Summary

A successful job is reported as a failure by dagster-k8s when it is retried due to backoff_limit > 0

dagster_k8s.client.DagsterK8sError: Encountered failed job pods for job dagster-job-4fa7db7aa93211d03f3eca0e2acff339 with status: {'active': 1,
'completion_time': None,
'conditions': None,
'failed': 1,
'start_time': datetime.datetime(2022, 1, 18, 15, 14, 30, tzinfo=tzlocal()),
'succeeded': None}, in namespace faculty-dagster
  File "/usr/local/lib/python3.7/site-packages/dagster_celery_k8s/executor.py", line 457, in _execute_step_k8s_job
    wait_timeout=job_wait_timeout,
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/utils.py", line 41, in wait_for_job_success
    num_pods_to_wait_for,
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 278, in wait_for_job_success
    job_name=job_name, status=status, namespace=namespace

A failed pod does not imply a failed job. I believe this is a bug.

Reproduction

If you can create a solid which fails on first attempt and set backoff_limit to 1, you will find the job succeeds but dagster reports as failure.

I have this issue in 0.12.5, but believe that it is still present in 0.13.14

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

gjeuselcommented, Mar 21, 2022

I also believe backoff_limit should be used to retry in case of APIServer failures/timeout, specifically when there is a trigger of auto-scaling marking the status of the pod as OutOfcpu for example.

1reaction

alex-treebeardcommented, Mar 8, 2022

@johannkm Having tested this again, I can confirm that dagster RetryPolicy will only retry if business logic fails, whereas backoff_limit allows us to retry APIServer failures/timeout, so I would like to upstream this fix if possible.

Top Results From Across the Web

Handling retriable and non-retriable pod failures with Pod ...

The definition of Pod failure policy may help you to: better utilize the computational resources by avoiding unnecessary Pod retries. avoid Job ......

Source code for dagster_k8s.executor - Dagster Docs

_core.execution.retries import RetryMode, get_retries_config from dagster. ... Configuration set on the Kubernetes Jobs and Pods created by the ...

Kubernetes Jobs | Use Cases, Scheduling, and Failure

Learn more about Kubernetes best practices and job cases. This article will even teach you how to create kubernetes jobs and how to...

How to determine if a job is failed - kubernetes - Stack Overflow

backoffLimit (default value 6), which says,. Specifies the number of retries before marking this job failed. Now In JobStatus. There are two ...

Gitlab k8s runner changed the status from ...

Summary In the earliest versions of Gitlab, we had retry logic based on job failure status(runner_system_failure) if the job executor(k8s ...