question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Katib experiments run indefintely without completing a single trial

See original GitHub issue

/kind bug

Hi, I’m setting a Katib job through the Kale deployment panel - after creating a Kale pipeline. The pipeline builds successfully but the Katib experiments run forever and don’t complete a single trial.

I expect the Katib jobs to run successfully, but to no avail.

Any way/suggestion to go about this?

Environment:

  • Kubeflow version (kfctl version):
  • Minikube version (minikube version):
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:39 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
Dampolo03commented, Apr 1, 2021

@andreyvelich I’ve been able to figure out what to do for my version of Katib - using the goTemplate on my trialTemplate with apiVersion: batch/v1 and kind: Job for CatBoost and other sklearn models (unlike the recent version of Katib). Will close this issue now but may re-open it if another issue occurs with my Katib version.

0reactions
Dampolo03commented, Mar 29, 2021

@andreyvelich No, stopped working with the Kale deployment panel after I reported the problem. I’ve been making use of yaml scripts in the Katib UI on my Kubeflow cluster. And the command gave this:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"labels":{"app":"katib-controller","app.kubernetes.io/component":"katib","app.kubernetes.io/instance":"katib-controller-0.8.0","app.kubernetes.io/managed-by":"kfctl","app.kubernetes.io/name":"katib-controller","app.kubernetes.io/part-of":"kubeflow","app.kubernetes.io/version":"0.8.0"},"name":"katib-controller","namespace":"kubeflow"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"katib-controller","app.kubernetes.io/component":"katib","app.kubernetes.io/instance":"katib-controller-0.8.0","app.kubernetes.io/managed-by":"kfctl","app.kubernetes.io/name":"katib-controller","app.kubernetes.io/part-of":"kubeflow","app.kubernetes.io/version":"0.8.0"}},"template":{"metadata":{"annotations":{"prometheus.io/scrape":"true","sidecar.istio.io/inject":"false"},"labels":{"app":"katib-controller","app.kubernetes.io/component":"katib","app.kubernetes.io/instance":"katib-controller-0.8.0","app.kubernetes.io/managed-by":"kfctl","app.kubernetes.io/name":"katib-controller","app.kubernetes.io/part-of":"kubeflow","app.kubernetes.io/version":"0.8.0"}},"spec":{"containers":[{"args":["--webhook-port=8443"],"command":["./katib-controller"],"env":[{"name":"KATIB_CORE_NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"image":"gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0","imagePullPolicy":"IfNotPresent","name":"katib-controller","ports":[{"containerPort":8443,"name":"webhook","protocol":"TCP"},{"containerPort":8080,"name":"metrics","protocol":"TCP"}],"volumeMounts":[{"mountPath":"/tmp/cert","name":"cert","readOnly":true}]}],"serviceAccountName":"katib-controller","volumes":[{"name":"cert","secret":{"defaultMode":420,"secretName":"katib-controller"}}]}}}}
  creationTimestamp: "2021-03-17T08:56:35Z"
  generation: 1
  labels:
    app: katib-controller
    app.kubernetes.io/component: katib
    app.kubernetes.io/instance: katib-controller-0.8.0
    app.kubernetes.io/managed-by: kfctl
    app.kubernetes.io/name: katib-controller
    app.kubernetes.io/part-of: kubeflow
    app.kubernetes.io/version: 0.8.0
  managedFields:
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:labels:
          .: {}
          f:app: {}
          f:app.kubernetes.io/component: {}
          f:app.kubernetes.io/instance: {}
          f:app.kubernetes.io/managed-by: {}
          f:app.kubernetes.io/name: {}
          f:app.kubernetes.io/part-of: {}
          f:app.kubernetes.io/version: {}
      f:spec:
        f:progressDeadlineSeconds: {}
        f:replicas: {}
        f:revisionHistoryLimit: {}
        f:selector:
          f:matchLabels:
            .: {}
            f:app: {}
            f:app.kubernetes.io/component: {}
            f:app.kubernetes.io/instance: {}
            f:app.kubernetes.io/managed-by: {}
            f:app.kubernetes.io/name: {}
            f:app.kubernetes.io/part-of: {}
            f:app.kubernetes.io/version: {}
        f:strategy:
          f:rollingUpdate:
            .: {}
            f:maxSurge: {}
            f:maxUnavailable: {}
          f:type: {}
        f:template:
          f:metadata:
            f:annotations:
              .: {}
              f:prometheus.io/scrape: {}
              f:sidecar.istio.io/inject: {}
            f:labels:
              .: {}
              f:app: {}
              f:app.kubernetes.io/component: {}
              f:app.kubernetes.io/instance: {}
              f:app.kubernetes.io/managed-by: {}
              f:app.kubernetes.io/name: {}
              f:app.kubernetes.io/part-of: {}
              f:app.kubernetes.io/version: {}
          f:spec:
            f:containers:
              k:{"name":"katib-controller"}:
                .: {}
                f:args: {}
                f:command: {}
                f:env:
                  .: {}
                  k:{"name":"KATIB_CORE_NAMESPACE"}:
                    .: {}
                    f:name: {}
                    f:valueFrom:
                      .: {}
                      f:fieldRef:
                        .: {}
                        f:apiVersion: {}
                        f:fieldPath: {}
                f:image: {}
                f:imagePullPolicy: {}
                f:name: {}
                f:ports:
                  .: {}
                  k:{"containerPort":8080,"protocol":"TCP"}:
                    .: {}
                    f:containerPort: {}
                    f:name: {}
                    f:protocol: {}
                  k:{"containerPort":8443,"protocol":"TCP"}:
                    .: {}
                    f:containerPort: {}
                    f:name: {}
                    f:protocol: {}
                f:resources: {}
                f:terminationMessagePath: {}
                f:terminationMessagePolicy: {}
                f:volumeMounts:
                  .: {}
                  k:{"mountPath":"/tmp/cert"}:
                    .: {}
                    f:mountPath: {}
                    f:name: {}
                    f:readOnly: {}
            f:dnsPolicy: {}
            f:restartPolicy: {}
            f:schedulerName: {}
            f:securityContext: {}
            f:serviceAccount: {}
            f:serviceAccountName: {}
            f:terminationGracePeriodSeconds: {}
            f:volumes:
              .: {}
              k:{"name":"cert"}:
                .: {}
                f:name: {}
                f:secret:
                  .: {}
                  f:defaultMode: {}
                  f:secretName: {}
    manager: kfctl
    operation: Update
    time: "2021-03-17T08:56:35Z"
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .: {}
          k:{"uid":"cbdfbf87-0cc8-481a-ba2d-b012e89ba9f1"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
    manager: manager
    operation: Update
    time: "2021-03-17T08:56:36Z"
  - apiVersion: apps/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:deployment.kubernetes.io/revision: {}
      f:status:
        f:availableReplicas: {}
        f:conditions:
          .: {}
          k:{"type":"Available"}:
            .: {}
            f:lastTransitionTime: {}
            f:lastUpdateTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Progressing"}:
            .: {}
            f:lastTransitionTime: {}
            f:lastUpdateTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:observedGeneration: {}
        f:readyReplicas: {}
        f:replicas: {}
        f:updatedReplicas: {}
    manager: kube-controller-manager
    operation: Update
    time: "2021-03-17T08:56:47Z"
  name: katib-controller
  namespace: kubeflow
  ownerReferences:
  - apiVersion: app.k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: false
    kind: Application
    name: katib-controller
    uid: cbdfbf87-0cc8-481a-ba2d-b012e89ba9f1
  resourceVersion: "5844"
  selfLink: /apis/apps/v1/namespaces/kubeflow/deployments/katib-controller
  uid: 5b12d4dd-a0b7-4f64-9785-fdfa969d49de
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: katib-controller
      app.kubernetes.io/component: katib
      app.kubernetes.io/instance: katib-controller-0.8.0
      app.kubernetes.io/managed-by: kfctl
      app.kubernetes.io/name: katib-controller
      app.kubernetes.io/part-of: kubeflow
      app.kubernetes.io/version: 0.8.0
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        sidecar.istio.io/inject: "false"
      creationTimestamp: null
      labels:
        app: katib-controller
        app.kubernetes.io/component: katib
        app.kubernetes.io/instance: katib-controller-0.8.0
        app.kubernetes.io/managed-by: kfctl
        app.kubernetes.io/name: katib-controller
        app.kubernetes.io/part-of: kubeflow
        app.kubernetes.io/version: 0.8.0
    spec:
      containers:
      - args:
        - --webhook-port=8443
        command:
        - ./katib-controller
        env:
        - name: KATIB_CORE_NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        image: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0
        imagePullPolicy: IfNotPresent
        name: katib-controller
        ports:
        - containerPort: 8443
          name: webhook
          protocol: TCP
        - containerPort: 8080
          name: metrics
          protocol: TCP
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp/cert
          name: cert
          readOnly: true
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: katib-controller
      serviceAccountName: katib-controller
      terminationGracePeriodSeconds: 30
      volumes:
      - name: cert
        secret:
          defaultMode: 420
          secretName: katib-controller
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-03-17T08:56:35Z"
    lastUpdateTime: "2021-03-17T08:56:40Z"
    message: ReplicaSet "katib-controller-5c976769d8" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  - lastTransitionTime: "2021-03-17T08:56:47Z"
    lastUpdateTime: "2021-03-17T08:56:47Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1
Read more comments on GitHub >

github_iconTop Results From Across the Web

UI doesn't show graph and all trials stay in running state - GitHub
It's almost like katib is not getting the return code (0 or otherwise) from the spawned process so things remain in the running...
Read more >
Running an Experiment - Kubeflow
You can run the experiment without specifying the goal . In that case, Katib runs the experiment until the corresponding successful trials ......
Read more >
Educational Learning Theories: 2nd Edition
A student is not completing homework assignments. The teacher and the ... Nabi and Clark (2008) conducted experiments about individual's.
Read more >
Duration of Adjuvant Aromatase-Inhibitor Therapy in ...
The primary analysis included all the patients who were still participating in the trial and who had no recurrence 2 years after ...
Read more >
Japan, South Korea can stop GMO testing -wheat group official
Japan and South Korea are continuing to test the U.S. wheat they buy to make sure the grain is not contaminated with an...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found