question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[dagster-k8s] Retried k8s job is reported as failure

See original GitHub issue

Summary

A successful job is reported as a failure by dagster-k8s when it is retried due to backoff_limit > 0

dagster_k8s.client.DagsterK8sError: Encountered failed job pods for job dagster-job-4fa7db7aa93211d03f3eca0e2acff339 with status: {'active': 1,
'completion_time': None,
'conditions': None,
'failed': 1,
'start_time': datetime.datetime(2022, 1, 18, 15, 14, 30, tzinfo=tzlocal()),
'succeeded': None}, in namespace faculty-dagster
  File "/usr/local/lib/python3.7/site-packages/dagster_celery_k8s/executor.py", line 457, in _execute_step_k8s_job
    wait_timeout=job_wait_timeout,
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/utils.py", line 41, in wait_for_job_success
    num_pods_to_wait_for,
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 278, in wait_for_job_success
    job_name=job_name, status=status, namespace=namespace

A failed pod does not imply a failed job. I believe this is a bug.

Reproduction

If you can create a solid which fails on first attempt and set backoff_limit to 1, you will find the job succeeds but dagster reports as failure.

I have this issue in 0.12.5, but believe that it is still present in 0.13.14

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:4
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
gjeuselcommented, Mar 21, 2022

I also believe backoff_limit should be used to retry in case of APIServer failures/timeout, specifically when there is a trigger of auto-scaling marking the status of the pod as OutOfcpu for example.

1reaction
alex-treebeardcommented, Mar 8, 2022

@johannkm Having tested this again, I can confirm that dagster RetryPolicy will only retry if business logic fails, whereas backoff_limit allows us to retry APIServer failures/timeout, so I would like to upstream this fix if possible.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Handling retriable and non-retriable pod failures with Pod ...
The definition of Pod failure policy may help you to: better utilize the computational resources by avoiding unnecessary Pod retries. avoid Job ......
Read more >
Source code for dagster_k8s.executor - Dagster Docs
_core.execution.retries import RetryMode, get_retries_config from dagster. ... Configuration set on the Kubernetes Jobs and Pods created by the ...
Read more >
Kubernetes Jobs | Use Cases, Scheduling, and Failure
Learn more about Kubernetes best practices and job cases. This article will even teach you how to create kubernetes jobs and how to...
Read more >
How to determine if a job is failed - kubernetes - Stack Overflow
backoffLimit (default value 6), which says,. Specifies the number of retries before marking this job failed. Now In JobStatus. There are two ...
Read more >
Gitlab k8s runner changed the status from ...
Summary In the earliest versions of Gitlab, we had retry logic based on job failure status(runner_system_failure) if the job executor(k8s ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found