question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Katib tfjob-example is failed

See original GitHub issue

/kind bug I was trying to run tfjob-example example from the katib document. But it failed.

What steps did you take and what happened:

$ kubectl -n kubeflow get studyjob
NAME             CONDITION   AGE
random-example   Completed   9h
$ kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/tfjob-example.yaml
studyjob.kubeflow.org/tfjob-example created
$ kubectl -n kubeflow get studyjob
NAME             CONDITION   AGE
random-example   Completed   9h
tfjob-example    Failed      3s

What did you expect to happen: It should not fail.

Anything else you would like to add:

$ kubectl -n kubeflow describe studyjob tfjob-example
Name:         tfjob-example
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  kubeflow.org/v1alpha1
Kind:         StudyJob
Metadata:
  Creation Timestamp:  2019-06-18T15:48:56Z
  Finalizers:
    clean-studyjob-data
  Generation:        4
  Resource Version:  186198
  Self Link:         /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/tfjob-example
  UID:               94d2cd1b-91e0-11e9-ad93-ac1f6b2d2bc6
Spec:
  Metrics Collector Spec:
    Go Template:
      Raw Template:  apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: {{.WorkerID}}
  namespace: kubeflow
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 1
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: {{.WorkerID}}
            image: gcr.io/kubeflow-ci/katib/tfevent-metrics-collector:v0.1.2-alpha-77-g9324cad
            args:
            - "python"
            - "main.py"
            - "-m"
            - "vizier-core"
            - "-s"
            - "{{.StudyID}}"
            - "-w"
            - "{{.WorkerID}}"
            - "-d"
            - "/train/{{.WorkerID}}"
            volumeMounts:
                - mountPath: "/train"
                  name: "train"
          volumes:
            - name: "train"
              persistentVolumeClaim:
                  claimName: "tfevent-volume"
          restartPolicy: Never
          serviceAccountName: metrics-collector
  Metricsnames:
    accuracy_1
  Objectivevaluename:  accuracy_1
  Optimizationgoal:    0.99
  Optimizationtype:    maximize
  Owner:               crd
  Parameterconfigs:
    Feasible:
      Max:          0.05
      Min:          0.01
    Name:           --learning_rate
    Parametertype:  double
    Feasible:
      Max:          200
      Min:          100
    Name:           --batch_size
    Parametertype:  int
  Requestcount:     4
  Study Name:       tfjob-example
  Suggestion Spec:
    Request Number:        3
    Suggestion Algorithm:  random
    Suggestion Parameters:
      Name:   SuggestionCount
      Value:  0
  Worker Spec:
    Go Template:
      Raw Template:  apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: {{.WorkerID}}
  namespace: kubeflow
spec:
 tfReplicaSpecs:
  Worker:
    replicas: 1
    restartPolicy: Never
    template:
      spec:
        containers:
          - name: tensorflow
            image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
            command:
              - "python"
              - "/var/tf_mnist/mnist_with_summaries.py"
              - "--log_dir=/train/{{.WorkerID}}"
              {{- with .HyperParameters}}
              {{- range .}}
              - "{{.Name}}={{.Value}}"
              {{- end}}
              {{- end}}
            volumeMounts:
              - mountPath: "/train"
                name: "train"
        volumes:
          - name: "train"
            persistentVolumeClaim:
              claimName: "tfevent-volume"
Status:
  Condition:                Failed
  Last Reconcile Time:      2019-06-18T15:48:57Z
  Start Time:               2019-06-18T15:48:56Z
  Studyid:                  f938b35c820480fb
  Suggestion Parameter Id:  b268eead24b9e881
Events:                     <none>

Environment:

  • Kubeflow version: 0.5.1
  • Kubernetes version: (use kubectl version): 1.14.3
  • OS (e.g. from /etc/os-release): Ubuntu 18.04.2 LTS

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:11 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
johnugeorgecommented, Jun 18, 2019

https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/tfjob-example.yaml#L31 Can you change to kubeflow.org/v1beta1 and try out the same example? You have to delete the example and reapply again

0reactions
asispatracommented, Jun 18, 2019

@johnugeorge Thanks for the quick reply and make it working.

Closing this issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Overview of Trial Templates - Katib - Kubeflow
To run the Katib experiment you have to specify a trial template for ... The trial worker's object status in which trial's job...
Read more >
Metrics not reporting to Katib server - experiment timing out
The problem is that I cant report the metrics to katib server. Since the report is not happening, ... This gives me an...
Read more >
Kubeflow Katib & Hyperparameter Tuning - YouTube
Lightning talk presented on March 12, 2019 at the Kubeflow Contributor Summit in Sunnyvale, CA.
Read more >
A Tour of Katib's new UI for Kubeflow 1.3 - YouTube
Kimonas Sotirchos, one of our full stack engineers and approver in the Notebooks Working Group (WG), will take you on a quick tour...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found