Objective metric name not found in v1alpha2/tfjob-example.yaml
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
I followed exactly https://github.com/kubeflow/katib/blob/master/examples/v1alpha2/tfjob-example.yaml.
Only changed the trialTemplate
to some code I have already tested successfully in a separate yaml file.
Also added persistentVolumeClaim
.
The model code we use is not the one built into the image but from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py
Checking the Katib controller logs, I found: {“level”:“info”,“ts”:1564455112.5005455,“logger”:“experiment-status-util”,“caller”:“util/status_util.go:72”,“msg”:“Objective metric name not found”,“trial”:“katib-mnist-with-summaries-from-hdfs-w7csjwhp”}
What did you expect to happen:
The objective metric name (https://github.com/kubeflow/katib/blob/master/examples/v1alpha2/tfjob-example.yaml#L13) can be found so the Katib experiment can be completed.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
namespace: kubeflow
name: katib-mnist-with-summaries-from-hdfs
spec:
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
objective:
type: maximize
goal: 0.99
objectiveMetricName: accuracy_1
algorithm:
algorithmName: random
parameters:
- name: --learning_rate
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.05"
- name: --batch_size
parameterType: int
feasibleSpace:
min: "100"
max: "200"
trialTemplate:
retain: true
goTemplate:
rawTemplate: |-
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
tfReplicaSpecs:
Ps:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- args: &args
- "python3 /code/mnist_with_summaries.py
--data_dir XXX
--log_dir /shared-data/{{.Trial}}
{{- with .HyperParameters}}
{{- range .}}
{{.Name}}={{.Value}}
{{- end}}
{{- end}}"
command: &command
- /bin/sh
- -c
image: &image XXX
name: &name tensorflow
volumeMounts: &volumeMounts
- name: mnist-with-summaries-code
mountPath: /code
- name: shared-data
mountPath: /var/tmp
volumes: &volumes
- name: mnist-with-summaries-code
configMap:
name: mnist-with-summaries-code
- name: shared-data
emptyDir: {}
persistentVolumeClaim:
claimName: katib-mysql
tfReplicaType: PS
Chief:
replicas: 1
restartPolicy: OnFailure
template:
spec:
initContainers: *initContainers
containers:
- args: *args
command: *command
env: *env
image: *image
name: *name
volumeMounts: *volumeMounts
volumes: *volumes
tfReplicaType: MASTER
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
initContainers: *initContainers
containers:
- args: *args
command: *command
env: *env
image: *image
name: *name
volumeMounts: *volumeMounts
volumes: *volumes
tfReplicaType: WORKER
Evaluator:
replicas: 1
restartPolicy: OnFailure
template:
spec:
initContainers: *initContainers
containers:
- args: *args
command: *command
env: *env
image: *image
name: *name
volumeMounts: *volumeMounts
volumes: *volumes
tfReplicaType: WORKER
metricsCollectorSpec:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
schedule: "*/1 * * * *"
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: {{.Trial}}
image: XXX
args:
- "python"
- "main.py"
- "-t"
- "{{.Trial}}"
- "-d"
- "/shared-data/{{.Trial}}"
- "-m"
- "accuracy_1"
volumeMounts:
- name: shared-data
mountPath: /var/tmp
volumes:
- name: shared-data
persistentVolumeClaim:
claimName: katib-mysql
restartPolicy: Never
serviceAccountName: metrics-collector
Environment:
- Kubeflow version: 0.5
- Minikube version: N/A, own cluster
- Kubernetes version: (use
kubectl version
): kubectl version Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-19T16:40:16Z”, GoVersion:“go1.12.5”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.0”, GitCommit:“641856db18352033a0d96dbc99153fa3b27298e5”, GitTreeState:“clean”, BuildDate:“2019-03-25T15:45:25Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“linux/amd64”} - OS (e.g. from
/etc/os-release
): rhel6
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (6 by maintainers)
Top GitHub Comments
@cyzhangchenya Yeah, actually we are designing the next generation metrics collector. And we definitely output some errors somewhere.
@gaocegege: Closing this issue.
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.