Katib tfjob-example is failed
See original GitHub issue/kind bug
I was trying to run tfjob-example
example from the katib document. But it failed.
What steps did you take and what happened:
$ kubectl -n kubeflow get studyjob
NAME CONDITION AGE
random-example Completed 9h
$ kubectl create -f https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1alpha1/tfjob-example.yaml
studyjob.kubeflow.org/tfjob-example created
$ kubectl -n kubeflow get studyjob
NAME CONDITION AGE
random-example Completed 9h
tfjob-example Failed 3s
What did you expect to happen: It should not fail.
Anything else you would like to add:
$ kubectl -n kubeflow describe studyjob tfjob-example
Name: tfjob-example
Namespace: kubeflow
Labels: controller-tools.k8s.io=1.0
Annotations: <none>
API Version: kubeflow.org/v1alpha1
Kind: StudyJob
Metadata:
Creation Timestamp: 2019-06-18T15:48:56Z
Finalizers:
clean-studyjob-data
Generation: 4
Resource Version: 186198
Self Link: /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/tfjob-example
UID: 94d2cd1b-91e0-11e9-ad93-ac1f6b2d2bc6
Spec:
Metrics Collector Spec:
Go Template:
Raw Template: apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
schedule: "*/1 * * * *"
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: {{.WorkerID}}
image: gcr.io/kubeflow-ci/katib/tfevent-metrics-collector:v0.1.2-alpha-77-g9324cad
args:
- "python"
- "main.py"
- "-m"
- "vizier-core"
- "-s"
- "{{.StudyID}}"
- "-w"
- "{{.WorkerID}}"
- "-d"
- "/train/{{.WorkerID}}"
volumeMounts:
- mountPath: "/train"
name: "train"
volumes:
- name: "train"
persistentVolumeClaim:
claimName: "tfevent-volume"
restartPolicy: Never
serviceAccountName: metrics-collector
Metricsnames:
accuracy_1
Objectivevaluename: accuracy_1
Optimizationgoal: 0.99
Optimizationtype: maximize
Owner: crd
Parameterconfigs:
Feasible:
Max: 0.05
Min: 0.01
Name: --learning_rate
Parametertype: double
Feasible:
Max: 200
Min: 100
Name: --batch_size
Parametertype: int
Requestcount: 4
Study Name: tfjob-example
Suggestion Spec:
Request Number: 3
Suggestion Algorithm: random
Suggestion Parameters:
Name: SuggestionCount
Value: 0
Worker Spec:
Go Template:
Raw Template: apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: {{.WorkerID}}
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 1
restartPolicy: Never
template:
spec:
containers:
- name: tensorflow
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
- "--log_dir=/train/{{.WorkerID}}"
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
volumeMounts:
- mountPath: "/train"
name: "train"
volumes:
- name: "train"
persistentVolumeClaim:
claimName: "tfevent-volume"
Status:
Condition: Failed
Last Reconcile Time: 2019-06-18T15:48:57Z
Start Time: 2019-06-18T15:48:56Z
Studyid: f938b35c820480fb
Suggestion Parameter Id: b268eead24b9e881
Events: <none>
Environment:
- Kubeflow version: 0.5.1
- Kubernetes version: (use
kubectl version
): 1.14.3 - OS (e.g. from
/etc/os-release
): Ubuntu 18.04.2 LTS
Issue Analytics
- State:
- Created 4 years ago
- Comments:11 (5 by maintainers)
Top Results From Across the Web
Overview of Trial Templates - Katib - Kubeflow
To run the Katib experiment you have to specify a trial template for ... The trial worker's object status in which trial's job...
Read more >Metrics not reporting to Katib server - experiment timing out
The problem is that I cant report the metrics to katib server. Since the report is not happening, ... This gives me an...
Read more >Kubeflow Katib & Hyperparameter Tuning - YouTube
Lightning talk presented on March 12, 2019 at the Kubeflow Contributor Summit in Sunnyvale, CA.
Read more >A Tour of Katib's new UI for Kubeflow 1.3 - YouTube
Kimonas Sotirchos, one of our full stack engineers and approver in the Notebooks Working Group (WG), will take you on a quick tour...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
https://github.com/kubeflow/katib/blob/master/examples/v1alpha1/tfjob-example.yaml#L31 Can you change to
kubeflow.org/v1beta1
and try out the same example? You have to delete the example and reapply again@johnugeorge Thanks for the quick reply and make it working.
Closing this issue.