question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Modify Job provider to support any kind of Kubernetes CRDs

See original GitHub issue

/kind feature

After migrating to the new Trial Template design (https://github.com/kubeflow/katib/pull/1202), we want to extend current Job provider to support any kind of Kubernetes CRDs that follow Trial job patterns (e.g Argo template: https://github.com/kubeflow/katib/issues/1081).

Currently, Job provider supports only batch Jobs and Kubeflow Jobs.

We can extend Trial Template API with the custom settings to define:

  1. How to get succeeded state of the Job.
  2. How to properly mutate metrics collector on training pod.
  3. How to specify sidecar.istio.io/inject: false annotation. (That can be done by user in advance).

Maybe we should define something more, this needs to be investigated.

Let’s discuss about all required changes in this issue.

/cc @gaocegege @johnugeorge @czheng94 @nielsmeima

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
nielsmeimacommented, Jun 10, 2020

I would like to contribute a bit to point 2 you mentioned @andreyvelich. I implemented support for the Argo Workflow CRD using the v1alpha3 Provider interface and discovered some issues which are also relevant to the new proposed API (great work btw, looks very exciting).

When running an Argo Workflow with a DAG template (https://argoproj.github.io/docs/argo/examples/readme.html#dag) each pod is started with two containers: wait and main. The wait container is backed by the argoexec image and can be considered a side-car to the main container, which is the “training” container. In the current implementation we can mark a single container in a pod the “training” container using the IsTrainingContainer func on the Provider interface. This leads to an issue in pns.go, where the PID of the wait / argoexec container is watched for completion. However, since this container is never marked to be watched for completion in inject_webhook (getMarkCompletedCommand), the watch loop crashes when trying to read the /var/log/katib/<PID>.pid file.

I currently added a simple addition to the filter statement in pns.go (pid == 1 || pid == thisPID || proc.PPid() != 0 || proc.Executable() == "argoexec") as a temporary solution. However, it is clear that we should introduce a mechanism to account for this.

I have thought of the following potential solutions:

  1. Simply wrap all processes as we currently do with containers marked as training containers. This approach would be compatible with other tool that are integrating too: they might control their own sidecars. We simply watch till every sidecar completes before deleting the CRD in question.
  2. Filter all containers (similar to my temporary solution above, but more general) except for the training container. This approach is not preferred, because the status of other (potentially important) sidecars is ignored.
  3. Introduce a field in the new Trial spec API which allows a list of containers to be specified which must be watched for completion.

Furthermore, since we now want to be able to watch for multiple PIDs to run to completion in pns.go we want to set waitAll to true. However, there is a bug in the loop which watches for completion of all PIDs. I can make a PR for a fix, since I had to already fix it for my own Argo implementation as well. The bug can be seen by comparing pns.py and pns.go. In the case of the golang version, once a single PID has completed, the error condition at the bottom is reached instead of also watching completion for the other containers. The solution would be to add an else statement:

if len(finishedPids) == len(pids) {
    return nil
} else {
    break
}

and a statement at the beginning of the loop to filter out PIDs which are already completed:

var skip = false
for _, finishedPid := range finishedPids {
    if finishedPid == pid {
        skip = true
        break
    }
}

if skip {
    continue
}

If you have any questions or if I am being unclear, please let me know.

0reactions
andreyvelichcommented, Oct 16, 2020

This feature is implemented in upstream.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Custom Resources - Kubernetes
You can deploy and update a custom controller on a running cluster, independently of the cluster's lifecycle. Custom controllers can work with ...
Read more >
Learn to use Kubernetes CRDs in this tutorial example
With Kubernetes CRDs, IT teams can customize the platform to meet their needs. Follow this tutorial to get started.
Read more >
Extending the Kubernetes API with Custom Resources
After you create a cluster-scoped custom resource definition (CRD), you can grant permissions to it. If you use the admin, edit, and view...
Read more >
Manage Kubernetes Custom Resources | Terraform
Deploy an OpenFaaS serverless function on Kubernetes with a Custom Resources ... You can manage CRDs with the kubernetes_manifest Terraform resource type.
Read more >
kind – Configuration - Kubernetes
Name Your Cluster; Feature Gates; Runtime Config; Networking; Nodes ... Any given version of kind may support different versions which will have different...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found