question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Objective metric name not found in v1alpha2/tfjob-example.yaml

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

I followed exactly https://github.com/kubeflow/katib/blob/master/examples/v1alpha2/tfjob-example.yaml.

Only changed the trialTemplate to some code I have already tested successfully in a separate yaml file.

Also added persistentVolumeClaim.

The model code we use is not the one built into the image but from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py

Checking the Katib controller logs, I found: {“level”:“info”,“ts”:1564455112.5005455,“logger”:“experiment-status-util”,“caller”:“util/status_util.go:72”,“msg”:“Objective metric name not found”,“trial”:“katib-mnist-with-summaries-from-hdfs-w7csjwhp”}

What did you expect to happen:

The objective metric name (https://github.com/kubeflow/katib/blob/master/examples/v1alpha2/tfjob-example.yaml#L13) can be found so the Katib experiment can be completed.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

apiVersion: "kubeflow.org/v1alpha2"
kind: Experiment
metadata:
  namespace: kubeflow
  name: katib-mnist-with-summaries-from-hdfs
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  parameters:
    - name: --learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: --batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    retain: true
    goTemplate:
        rawTemplate: |-
          apiVersion: kubeflow.org/v1
          kind: TFJob
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            tfReplicaSpecs:
              Ps:
                replicas: 1
                restartPolicy: OnFailure
                template:
                  spec:
                    containers:
                    - args: &args
                      - "python3 /code/mnist_with_summaries.py
                        --data_dir XXX
                        --log_dir /shared-data/{{.Trial}}
                        {{- with .HyperParameters}}
                        {{- range .}}
                        {{.Name}}={{.Value}}
                        {{- end}}
                        {{- end}}"
                      command: &command
                      - /bin/sh
                      - -c
                      image: &image XXX
                      name: &name tensorflow
                      volumeMounts: &volumeMounts
                      - name: mnist-with-summaries-code
                        mountPath: /code
                      - name: shared-data
                        mountPath: /var/tmp
                    volumes: &volumes
                      - name: mnist-with-summaries-code
                        configMap:
                          name: mnist-with-summaries-code
                      - name: shared-data
                        emptyDir: {}
                        persistentVolumeClaim:
                          claimName: katib-mysql
                tfReplicaType: PS
              Chief:
                replicas: 1
                restartPolicy: OnFailure
                template:
                  spec:
                    initContainers: *initContainers
                    containers:
                    - args: *args
                      command: *command
                      env: *env
                      image: *image
                      name: *name
                      volumeMounts: *volumeMounts
                    volumes: *volumes
                tfReplicaType: MASTER
              Worker:
                replicas: 2
                restartPolicy: OnFailure
                template:
                  spec:
                    initContainers: *initContainers
                    containers:
                    - args: *args
                      command: *command
                      env: *env
                      image: *image
                      name: *name
                      volumeMounts: *volumeMounts
                    volumes: *volumes
                tfReplicaType: WORKER
              Evaluator:
                replicas: 1
                restartPolicy: OnFailure
                template:
                  spec:
                    initContainers: *initContainers
                    containers:
                    - args: *args
                      command: *command
                      env: *env
                      image: *image
                      name: *name
                      volumeMounts: *volumeMounts
                    volumes: *volumes
                tfReplicaType: WORKER
  metricsCollectorSpec:
    goTemplate:
      rawTemplate: |-
        apiVersion: batch/v1beta1
        kind: CronJob
        metadata:
          name: {{.Trial}}
          namespace: {{.NameSpace}}
        spec:
          schedule: "*/1 * * * *"
          successfulJobsHistoryLimit: 0
          failedJobsHistoryLimit: 1
          jobTemplate:
            spec:
              template:
                spec:
                  containers:
                  - name: {{.Trial}}
                    image: XXX
                    args:
                    - "python"
                    - "main.py"
                    - "-t"
                    - "{{.Trial}}"
                    - "-d"
                    - "/shared-data/{{.Trial}}"
                    - "-m"
                    - "accuracy_1"
                    volumeMounts:
                        - name: shared-data
                          mountPath: /var/tmp                    
                  volumes:
                    - name: shared-data
                      persistentVolumeClaim:
                          claimName: katib-mysql
                  restartPolicy: Never
                  serviceAccountName: metrics-collector

Environment:

  • Kubeflow version: 0.5
  • Minikube version: N/A, own cluster
  • Kubernetes version: (use kubectl version): kubectl version Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-19T16:40:16Z”, GoVersion:“go1.12.5”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.0”, GitCommit:“641856db18352033a0d96dbc99153fa3b27298e5”, GitTreeState:“clean”, BuildDate:“2019-03-25T15:45:25Z”, GoVersion:“go1.12.1”, Compiler:“gc”, Platform:“linux/amd64”}
  • OS (e.g. from /etc/os-release): rhel6

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
gaocegegecommented, Jul 30, 2019

@cyzhangchenya Yeah, actually we are designing the next generation metrics collector. And we definitely output some errors somewhere.

0reactions
k8s-ci-robotcommented, Oct 10, 2019

@gaocegege: Closing this issue.

In response to this:

@cyzhangchenya Suggest using v1alpha3. Now we do not involve a DB in main workflow.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Hyperparameter Tuning (Katib) - Kubeflow
It runs a series of training jobs to train models using different hyperparameters and save the results. The configurations for the experiment ( ......
Read more >
Kfserving -- Error When Defining storageUri - Stack Overflow
I'm trying to deploy a very basic Sklearn model using Kfserving, here is the yaml file:
Read more >
Cisco Intersight Service Mesh Manager User Guide
Service Mesh Manager not only automates setting up multi-cluster topologies, ... Note:The YAML samples work with the Service Mesh Manager demo application.
Read more >
Collecting Metrics and Logs - Istio
This task shows you how to configure Istio to collect metrics and logs. ... "config.istio.io/v1alpha2" kind: prometheus metadata: name: doublehandler ...
Read more >
Serverless OpenShift Container Platform 4.10
Naming style in Knative Serving YAML configuration files changed from camel case ( ExampleName ) to hyphen style ( example-name ).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found