[Feature] Modify Job provider to support any kind of Kubernetes CRDs
See original GitHub issue/kind feature
After migrating to the new Trial Template design (https://github.com/kubeflow/katib/pull/1202), we want to extend current Job provider to support any kind of Kubernetes CRDs that follow Trial job patterns (e.g Argo template: https://github.com/kubeflow/katib/issues/1081).
Currently, Job provider supports only batch
Jobs and Kubeflow Jobs.
We can extend Trial Template API with the custom settings to define:
- How to get succeeded state of the Job.
- How to properly mutate metrics collector on training pod.
- How to specify
sidecar.istio.io/inject: false
annotation. (That can be done by user in advance).
Maybe we should define something more, this needs to be investigated.
Let’s discuss about all required changes in this issue.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (8 by maintainers)
Top Results From Across the Web
Custom Resources - Kubernetes
You can deploy and update a custom controller on a running cluster, independently of the cluster's lifecycle. Custom controllers can work with ...
Read more >Learn to use Kubernetes CRDs in this tutorial example
With Kubernetes CRDs, IT teams can customize the platform to meet their needs. Follow this tutorial to get started.
Read more >Extending the Kubernetes API with Custom Resources
After you create a cluster-scoped custom resource definition (CRD), you can grant permissions to it. If you use the admin, edit, and view...
Read more >Manage Kubernetes Custom Resources | Terraform
Deploy an OpenFaaS serverless function on Kubernetes with a Custom Resources ... You can manage CRDs with the kubernetes_manifest Terraform resource type.
Read more >kind – Configuration - Kubernetes
Name Your Cluster; Feature Gates; Runtime Config; Networking; Nodes ... Any given version of kind may support different versions which will have different...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3
Provider
interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).When running an Argo Workflow with a DAG template (https://argoproj.github.io/docs/argo/examples/readme.html#dag) each pod is started with two containers:
wait
andmain
. Thewait
container is backed by theargoexec
image and can be considered a side-car to themain
container, which is the “training” container. In the current implementation we can mark a single container in a pod the “training” container using theIsTrainingContainer
func on theProvider
interface. This leads to an issue inpns.go
, where the PID of thewait
/argoexec
container is watched for completion. However, since this container is never marked to be watched for completion ininject_webhook
(getMarkCompletedCommand
), the watch loop crashes when trying to read the/var/log/katib/<PID>.pid
file.I currently added a simple addition to the filter statement in
pns.go
(pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec"
) as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.I have thought of the following potential solutions:
Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in
pns.go
we want to setwaitAll
to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparingpns.py
andpns.go
. In the case of thegolang
version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:and a statement at the beginning of the loop to filter out PIDs which are already completed:
If you have any questions or if I am being unclear, please let me know.
This feature is implemented in upstream.