Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Katib example in docs is not working

See original GitHub issue

/kind bug

What steps did you take and what happened: I have a running Kubernetes (two nodes on-prem) cluster and installed Kubeflow using kfctl_k8s_istio config. Followed Getting Started with Katib, I have created a TensorFlow example and go through all 3 steps. This is my tfjob-example.yaml file:

apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  namespace: kubeflow
  name: tfjob-example
spec:
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: accuracy_1
  algorithm:
    algorithmName: random
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /train
        kind: Directory
    collector:
      kind: TensorFlowEvent
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.05"
    - name: batch_size
      parameterType: int
      feasibleSpace:
        min: "100"
        max: "200"
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
      - name: learningRate
        description: Learning rate for the training model
        reference: learning_rate
      - name: batchSize
        description: Batch Size
        reference: batch_size
    trialSpec:
      apiVersion: "kubeflow.org/v1"
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                  - name: tensorflow
                    image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                    imagePullPolicy: Always
                    command:
                      - "python"
                      - "/var/tf_mnist/mnist_with_summaries.py"
                      - "--log_dir=/train/metrics"
                      - "--learning_rate=${trialParameters.learningRate}"
                      - "--batch_size=${trialParameters.batchSize}"

What did you expect to happen: I expected to see the graphs and results of the experiments in Katib but all experiments remained in the Running status, although the logs of experiments containers shows that they are Completed.

Anything else you would like to add: Is seems the observation_logs is empty:

$ kubectl -n kubeflow exec -it katib-mysql-5df4dddc57-jzdqs -- bash

root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} -e 'show tables;'
mysql: [Warning] Using a password on the command line interface can be insecure.
+------------------+
| Tables_in_katib  |
+------------------+
| observation_logs |
+------------------+

root@katib-mysql-5df4dddc57-jzdqs:/# mysql -D ${MYSQL_DATABASE} -u root -p${MYSQL_ROOT_PASSWORD} 
mysql> select * from observation_logs;
Empty set (0.00 sec)

But, I don’t know why it happed and how to trace it. Everything other seems to be alright. Some other logs and debugging that I tried:

$ kubectl get pods --all-namespaces | grep tfj
kubeflow               tfjob-example-9sxb2jtg-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-9sxb2jtg-worker-1                              0/1     Completed   0          58m
kubeflow               tfjob-example-jtf9d96w-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-jtf9d96w-worker-1                              0/1     Completed   0          58m
kubeflow               tfjob-example-random-585dfc8499-r9g4x                        1/1     Running     0          58m
kubeflow               tfjob-example-twd8tsdk-worker-0                              0/1     Completed   0          58m
kubeflow               tfjob-example-twd8tsdk-worker-1                              0/1     Completed   0          58m

$ kubectl -n kubeflow get experiments
NAME            TYPE      STATUS   AGE
tfjob-example   Running   True     60m

$ kubectl -n kubeflow get trials
NAME                     TYPE      STATUS   AGE
tfjob-example-9sxb2jtg   Running   True     60m
tfjob-example-jtf9d96w   Running   True     60m
tfjob-example-twd8tsdk   Running   True     60m

$ kubectl -n kubeflow logs tfjob-example-9sxb2jtg-worker-0 --all-containers --tail=10
Accuracy at step 910: 0.9444
Accuracy at step 920: 0.9405
Accuracy at step 930: 0.9443
Accuracy at step 940: 0.9459
Accuracy at step 950: 0.9462
Accuracy at step 960: 0.9373
Accuracy at step 970: 0.9404
Accuracy at step 980: 0.945
Accuracy at step 990: 0.9485
Adding run metadata for 999

$ kubectl -n kubeflow logs -f katib-db-manager-59445ff6cb-wkcdp --all-containers
I0125 14:10:19.491012       1 init.go:11] Initializing v1beta1 DB schema
I0125 14:10:19.776431       1 main.go:92] Start Katib manager: 0.0.0.0:6789

$ kubectl -n kubeflow logs katib-controller-545bdfdb46-k6mlr --all-containers --tail=10
{"level":"info","ts":1612085695.0365138,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-twd8tsdk"}
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:35509: remote error: tls: bad certificate
2021/01/31 09:34:55 http: TLS handshake error from 10.244.0.0:32642: remote error: tls: bad certificate
{"level":"info","ts":1612085695.1833804,"logger":"trial-controller","msg":"Creating Job","Trial":"kubeflow/tfjob-example-9sxb2jtg","kind":"TFJob","name":"tfjob-example-9sxb2jtg"}
{"level":"info","ts":1612085695.2675023,"logger":"trial-controller","msg":"Trial status changed to Running","Trial":"kubeflow/tfjob-example-9sxb2jtg"}
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:7037: remote error: tls: bad certificate
2021/01/31 09:34:56 http: TLS handshake error from 10.244.0.0:12279: remote error: tls: bad certificate
{"level":"info","ts":1612086144.2860768,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.2967129,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}
{"level":"info","ts":1612086144.3100634,"logger":"suggestion-controller","msg":"Sync assignments","Suggestion":"kubeflow/tfjob-example","Suggestion Requests":3,"Suggestion Count":3}

$ kubectl -n kubeflow get experiment tfjob-example -o yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1beta1","kind":"Experiment","metadata":{"annotations":{},"name":"tfjob-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"metricsCollectorSpec":{"collector":{"kind":"TensorFlowEvent"},"source":{"fileSystemPath":{"kind":"Directory","path":"/train"}}},"objective":{"goal":0.99,"objectiveMetricName":"accuracy_1","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.05","min":"0.01"},"name":"learning_rate","parameterType":"double"},{"feasibleSpace":{"max":"200","min":"100"},"name":"batch_size","parameterType":"int"}],"trialTemplate":{"primaryContainerName":"tensorflow","trialParameters":[{"description":"Learning rate for the training model","name":"learningRate","reference":"learning_rate"},{"description":"Batch Size","name":"batchSize","reference":"batch_size"}],"trialSpec":{"apiVersion":"kubeflow.org/v1","kind":"TFJob","spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py","--log_dir=/train/metrics","--learning_rate=${trialParameters.learningRate}","--batch_size=${trialParameters.batchSize}"],"image":"gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0","imagePullPolicy":"Always","name":"tensorflow"}]}}}}}}}}}
  creationTimestamp: "2021-01-31T09:34:38Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  managedFields:
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        .: {}
        f:algorithm:
          .: {}
          f:algorithmName: {}
        f:maxFailedTrialCount: {}
        f:maxTrialCount: {}
        f:metricsCollectorSpec:
          .: {}
          f:collector:
            .: {}
            f:kind: {}
          f:source:
            .: {}
            f:fileSystemPath:
              .: {}
              f:kind: {}
              f:path: {}
        f:objective:
          .: {}
          f:goal: {}
          f:objectiveMetricName: {}
          f:type: {}
        f:parallelTrialCount: {}
        f:parameters: {}
        f:trialTemplate:
          .: {}
          f:primaryContainerName: {}
          f:trialParameters: {}
          f:trialSpec:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:spec:
              .: {}
              f:tfReplicaSpecs:
                .: {}
                f:Worker:
                  .: {}
                  f:replicas: {}
                  f:restartPolicy: {}
                  f:template:
                    .: {}
                    f:metadata:
                      .: {}
                      f:annotations:
                        .: {}
                        f:sidecar.istio.io/inject: {}
                    f:spec:
                      .: {}
                      f:containers: {}
    manager: kubectl-client-side-apply
    operation: Update
    time: "2021-01-31T09:34:38Z"
  - apiVersion: kubeflow.org/v1beta1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers: {}
      f:status:
        .: {}
        f:conditions: {}
        f:currentOptimalTrial:
          .: {}
          f:bestTrialName: {}
          f:observation:
            .: {}
            f:metrics: {}
          f:parameterAssignments: {}
        f:runningTrialList: {}
        f:startTime: {}
        f:trials: {}
        f:trialsRunning: {}
    manager: katib-controller
    operation: Update
    time: "2021-01-31T09:34:55Z"
  name: tfjob-example
  namespace: kubeflow
  resourceVersion: "5381129"
  uid: e6aedc20-d3ed-4829-ba49-c2a957427249
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: TensorFlowEvent
    source:
      fileSystemPath:
        kind: Directory
        path: /train
  objective:
    goal: 0.99
    objectiveMetricName: accuracy_1
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.05"
      min: "0.01"
    name: learning_rate
    parameterType: double
  - feasibleSpace:
      max: "200"
      min: "100"
    name: batch_size
    parameterType: int
  trialTemplate:
    primaryContainerName: tensorflow
    trialParameters:
    - description: Learning rate for the training model
      name: learningRate
      reference: learning_rate
    - description: Batch Size
      name: batchSize
      reference: batch_size
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: TFJob
      spec:
        tfReplicaSpecs:
          Worker:
            replicas: 2
            restartPolicy: OnFailure
            template:
              metadata:
                annotations:
                  sidecar.istio.io/inject: "false"
              spec:
                containers:
                - command:
                  - python
                  - /var/tf_mnist/mnist_with_summaries.py
                  - --log_dir=/train/metrics
                  - --learning_rate=${trialParameters.learningRate}
                  - --batch_size=${trialParameters.batchSize}
                  image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
                  imagePullPolicy: Always
                  name: tensorflow
status:
  conditions:
  - lastTransitionTime: "2021-01-31T09:34:38Z"
    lastUpdateTime: "2021-01-31T09:34:38Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2021-01-31T09:34:54Z"
    lastUpdateTime: "2021-01-31T09:34:54Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    bestTrialName: ""
    observation:
      metrics: null
    parameterAssignments: null
  runningTrialList:
  - tfjob-example-9sxb2jtg
  - tfjob-example-jtf9d96w
  - tfjob-example-twd8tsdk
  startTime: "2021-01-31T09:34:38Z"
  trials: 3
  trialsRunning: 3

Environment:

Kubeflow version: kfctl v1.2.0-0-gbc038f9
Kubernetes version:

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:09:25Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}