[feature] allow jobs to fail
See original GitHub issueFeature Area
/area backend
What feature would you like to see?
Allow failure of jobs - If an operation fails, do not fail the pipeline. Allow the pipeline to continue to the next stage, and there it may fail if that does not have the pre-requisites.
What is the use case or pain point?
In the machine learning pipelines, it is fairly common to run multiple models, or possibly different configurations of same model, and this possibly runs on a subset of training data. After these are trained, usually they are compared using some metric, the best model is chosen, and that is run on the entire training data to have the final trained model.
If someone uses kfp.dsl.ParallelFor
to run the different models, failure in one of them causes the entire pipeline to fail and possibly successful training of other models are lost. But if the next stage, the one to compare using metric supports comparison of the available (i.e. successful) models, the pipeline failure costs the time to train those models, as one have to restart. If we support the requested feature, the failed operations will display an warning (may be ⚠️), and will go on to final training step. Then depending on whether that supports comparison of subset of all models, it will proceed as if the failed models were not there. If not, it’ll fail there.
Very similar functionality in available in few CI tools. For example, Gitlab CI has allow_failure
, Travis CI has allow_failures
, etc.
Is there a workaround currently?
It is possible to do very broad top level exception handling to suppress failures. However, in this way the fact that it failed is hidden in the logs and not displayed in the pipeline dashboard. In scheduled pipelines where no one really go through the logs of all “successful” pipelines, these failures will go unnoticed.
Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:6
- Comments:9 (2 by maintainers)
Top GitHub Comments
Exit handler ops cannot explicitly depend on any previous operations so they cannot be parameterized by outputs of previous operations or be guaranteed to run after previous steps.
My use case is having integration tests run that are themselves kubeflow pipelines and I would like to be able to verify that a task fails without the integration test failing. Configuring that in the dsl would be a lot cleaner than being included in application logic or directly in ci/cd.
@yarnabrina , @chensun I’ve created a pull request implementing this behaviour - I would really appreciate your feedback on that https://github.com/kubeflow/pipelines/pull/7373.