[dagster-k8s] Retried k8s job is reported as failure
See original GitHub issueSummary
A successful job is reported as a failure by dagster-k8s when it is retried due to backoff_limit > 0
dagster_k8s.client.DagsterK8sError: Encountered failed job pods for job dagster-job-4fa7db7aa93211d03f3eca0e2acff339 with status: {'active': 1,
'completion_time': None,
'conditions': None,
'failed': 1,
'start_time': datetime.datetime(2022, 1, 18, 15, 14, 30, tzinfo=tzlocal()),
'succeeded': None}, in namespace faculty-dagster
File "/usr/local/lib/python3.7/site-packages/dagster_celery_k8s/executor.py", line 457, in _execute_step_k8s_job
wait_timeout=job_wait_timeout,
File "/usr/local/lib/python3.7/site-packages/dagster_k8s/utils.py", line 41, in wait_for_job_success
num_pods_to_wait_for,
File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 278, in wait_for_job_success
job_name=job_name, status=status, namespace=namespace
A failed pod does not imply a failed job. I believe this is a bug.
Reproduction
If you can create a solid which fails on first attempt and set backoff_limit to 1, you will find the job succeeds but dagster reports as failure.
I have this issue in 0.12.5, but believe that it is still present in 0.13.14
Issue Analytics
- State:
- Created 2 years ago
- Reactions:4
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Handling retriable and non-retriable pod failures with Pod ...
The definition of Pod failure policy may help you to: better utilize the computational resources by avoiding unnecessary Pod retries. avoid Job ......
Read more >Source code for dagster_k8s.executor - Dagster Docs
_core.execution.retries import RetryMode, get_retries_config from dagster. ... Configuration set on the Kubernetes Jobs and Pods created by the ...
Read more >Kubernetes Jobs | Use Cases, Scheduling, and Failure
Learn more about Kubernetes best practices and job cases. This article will even teach you how to create kubernetes jobs and how to...
Read more >How to determine if a job is failed - kubernetes - Stack Overflow
backoffLimit (default value 6), which says,. Specifies the number of retries before marking this job failed. Now In JobStatus. There are two ...
Read more >Gitlab k8s runner changed the status from ...
Summary In the earliest versions of Gitlab, we had retry logic based on job failure status(runner_system_failure) if the job executor(k8s ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I also believe
backoff_limit
should be used to retry in case of APIServer failures/timeout, specifically when there is a trigger of auto-scaling marking the status of the pod asOutOfcpu
for example.@johannkm Having tested this again, I can confirm that dagster RetryPolicy will only retry if business logic fails, whereas
backoff_limit
allows us to retry APIServer failures/timeout, so I would like to upstream this fix if possible.