Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Modify Job provider to support any kind of Kubernetes CRDs

See original GitHub issue

/kind feature

After migrating to the new Trial Template design (https://github.com/kubeflow/katib/pull/1202), we want to extend current Job provider to support any kind of Kubernetes CRDs that follow Trial job patterns (e.g Argo template: https://github.com/kubeflow/katib/issues/1081).

Currently, Job provider supports only batch Jobs and Kubeflow Jobs.

We can extend Trial Template API with the custom settings to define:

How to get succeeded state of the Job.
How to properly mutate metrics collector on training pod.
How to specify sidecar.istio.io/inject: false annotation. (That can be done by user in advance).

Maybe we should define something more, this needs to be investigated.

Let’s discuss about all required changes in this issue.

/cc @gaocegege @johnugeorge @czheng94 @nielsmeima

Issue Analytics

State:
Created 3 years ago
Comments:12 (8 by maintainers)

Top GitHub Comments

1reaction

nielsmeimacommented, Jun 10, 2020

I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 Provider interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).

When running an Argo Workflow with a DAG template (https://argoproj.github.io/docs/argo/examples/readme.html#dag) each pod is started with two containers: wait and main. The wait container is backed by the argoexec image and can be considered a side-car to the main container, which is the “training” container. In the current implementation we can mark a single container in a pod the “training” container using the IsTrainingContainer func on the Provider interface. This leads to an issue in pns.go, where the PID of the wait / argoexec container is watched for completion. However, since this container is never marked to be watched for completion in inject_webhook (getMarkCompletedCommand), the watch loop crashes when trying to read the /var/log/katib/<PID>.pid file.

I currently added a simple addition to the filter statement in pns.go (pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec") as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.

I have thought of the following potential solutions:

Simply wrap all processes as we currently do with containers marked as training containers. This approach would be compatible with other tool that are integrating too: they might control their own sidecars. We simply watch till every sidecar completes before deleting the CRD in question.
Filter all containers (similar to my temporary solution above, but more general) except for the training container. This approach is not preferred, because the status of other (potentially important) sidecars is ignored.
Introduce a field in the new Trial spec API which allows a list of containers to be specified which must be watched for completion.

Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in pns.go we want to set waitAll to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparing pns.py and pns.go. In the case of the golang version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:

if len(finishedPids) == len(pids) {
    return nil
} else {
    break
}

and a statement at the beginning of the loop to filter out PIDs which are already completed:

var skip = false
for _, finishedPid := range finishedPids {
    if finishedPid == pid {
        skip = true
        break
    }
}

if skip {
    continue
}

If you have any questions or if I am being unclear, please let me know.

0reactions

andreyvelichcommented, Oct 16, 2020

This feature is implemented in upstream.

Top Results From Across the Web

Custom Resources - Kubernetes

You can deploy and update a custom controller on a running cluster, independently of the cluster's lifecycle. Custom controllers can work with ...

Learn to use Kubernetes CRDs in this tutorial example

With Kubernetes CRDs, IT teams can customize the platform to meet their needs. Follow this tutorial to get started.

Extending the Kubernetes API with Custom Resources

After you create a cluster-scoped custom resource definition (CRD), you can grant permissions to it. If you use the admin, edit, and view...

Manage Kubernetes Custom Resources | Terraform

Deploy an OpenFaaS serverless function on Kubernetes with a Custom Resources ... You can manage CRDs with the kubernetes_manifest Terraform resource type.

kind – Configuration - Kubernetes

Name Your Cluster; Feature Gates; Runtime Config; Networking; Nodes ... Any given version of kind may support different versions which will have different...