Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trial job is succeeded but metrics are not reported, reconcile requeued

See original GitHub issue

/kind bug

What steps did you take and what happened: I just tried to run the random experiment example, through the Katib UI (I also tried creating an experiment using python, but the same error occurs).

Following the experiment creation with the UI, I only changed the trial template (YAML), with this:

apiVersion: batch/v1
kind: Job
metadata:
  annotations:
    sidecar.istio.io/inject: "false"
    katib-metricscollector-injection: enabled
    katib-metrics-collector-injection: enabled
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
        katib-metricscollector-injection: enabled
        katib-metrics-collector-injection: enabled     
    spec: 
      containers:
        - name: training-container
          image: docker.io/kubeflowkatib/mxnet-mnist:latest
          command:
            - "python3"
            - "/opt/mxnet-mnist/mnist.py"
            - "--batch-size=64"
            - "--lr=${trialParameters.learningRate}"
            - "--num-layers=${trialParameters.numberLayers}"
            - "--optimizer=${trialParameters.optimizer}"
      restartPolicy: Never

After a couple of minutes, the pods created by the job terminated, with the status Completed, and printed my objective metric as this:

2022-01-25T20:26:59Z INFO     Epoch[9] Train-accuracy=0.993770
2022-01-25T20:26:59Z INFO     Epoch[9] Time cost=5.344
2022-01-25T20:26:59Z INFO     Epoch[9] Validation-accuracy=0.978802

But the experiment, suggestions, and trials keep with status Running, and new trials are not created.

When I check the katib-controller logs, I get the following msg:

{"level":"info","ts":1643142603.5533006,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-vzkjcznm"}
{"level":"info","ts":1643142603.633143,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-c9qr67ww"}
{"level":"info","ts":1643142603.655875,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-smw6p6rg"}

Additional Information:

kubectl get experiment random-experiment -o yaml -n kubeflow-user-example-com

Results in:

Output


apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  creationTimestamp: "2022-01-25T20:25:22Z"
  finalizers:
  - update-prometheus-metrics
  generation: 1
  name: random-experiment
  namespace: kubeflow-user-example-com
  resourceVersion: "126860285"
  uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
spec:
  algorithm:
    algorithmName: random
  maxFailedTrialCount: 3
  maxTrialCount: 12
  metricsCollectorSpec:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parallelTrialCount: 3
  parameters:
  - feasibleSpace:
      max: "0.03"
      min: "0.01"
      step: "0.01"
    name: lr
    parameterType: double
  - feasibleSpace:
      max: "64"
      min: "1"
      step: "1"
    name: num-layers
    parameterType: int
  - feasibleSpace:
      list:
      - sgd
      - adam
      - ftrl
    name: optimizer
    parameterType: categorical
  resumePolicy: LongRunning
  trialTemplate:
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    trialParameters:
    - name: learningRate
      reference: lr
    - name: numberLayers
      reference: num-layers
    - name: optimizer
      reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      metadata:
        annotations:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
        labels:
          katib-metrics-collector-injection: enabled
          katib-metricscollector-injection: enabled
          sidecar.istio.io/inject: "false"
      spec:
        template:
          metadata:
            annotations:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
            labels:
              katib-metrics-collector-injection: enabled
              katib-metricscollector-injection: enabled
              sidecar.istio.io/inject: "false"
          spec:
            containers:
            - command:
              - python3
              - /opt/mxnet-mnist/mnist.py
              - --batch-size=64
              - --lr=${trialParameters.learningRate}
              - --num-layers=${trialParameters.numberLayers}
              - --optimizer=${trialParameters.optimizer}
              image: docker.io/kubeflowkatib/mxnet-mnist:latest
              name: training-container
            restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:22Z"
    lastUpdateTime: "2022-01-25T20:25:22Z"
    message: Experiment is created
    reason: ExperimentCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Experiment is running
    reason: ExperimentRunning
    status: "True"
    type: Running
  currentOptimalTrial:
    observation: {}
  runningTrialList:
  - random-experiment-smw6p6rg
  - random-experiment-c9qr67ww
  - random-experiment-vzkjcznm
  startTime: "2022-01-25T20:25:22Z"
  trials: 3
  trialsRunning: 3

and

kubectl get trial random-experiment-c9qr67ww -n  kubeflow-user-example-com  -o yaml

Results in:

Output


apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
  creationTimestamp: "2022-01-25T20:25:44Z"
  finalizers:
  - clean-metrics-in-db
  generation: 1
  labels:
    katib.kubeflow.org/experiment: random-experiment
  name: random-experiment-c9qr67ww
  namespace: kubeflow-user-example-com
  ownerReferences:
  - apiVersion: kubeflow.org/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: Experiment
    name: random-experiment
    uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
  resourceVersion: "126860266"
  uid: 24a7d825-2737-4d6f-8ba8-5e22d776443f
spec:
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  metricsCollector:
    collector:
      kind: StdOut
  objective:
    additionalMetricNames:
    - Train-accuracy
    goal: 0.05
    objectiveMetricName: Validation-accuracy
    type: maximize
  parameterAssignments:
  - name: lr
    value: "0.018768621111940782"
  - name: num-layers
    value: "7"
  - name: optimizer
    value: sgd
  primaryContainerName: training-container
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      annotations:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      labels:
        katib-metrics-collector-injection: enabled
        katib-metricscollector-injection: enabled
        sidecar.istio.io/inject: "false"
      name: random-experiment-c9qr67ww
      namespace: kubeflow-user-example-com
    spec:
      template:
        metadata:
          annotations:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
          labels:
            katib-metrics-collector-injection: enabled
            katib-metricscollector-injection: enabled
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python3
            - /opt/mxnet-mnist/mnist.py
            - --batch-size=64
            - --lr=0.018768621111940782
            - --num-layers=7
            - --optimizer=sgd
            image: docker.io/kubeflowkatib/mxnet-mnist:latest
            name: training-container
          restartPolicy: Never
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
  conditions:
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is created
    reason: TrialCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2022-01-25T20:25:44Z"
    lastUpdateTime: "2022-01-25T20:25:44Z"
    message: Trial is running
    reason: TrialRunning
    status: "True"
    type: Running
  startTime: "2022-01-25T20:25:44Z"

What did you expect to happen: Ideally, once the metrics are captured and the goal/maxTrial is reached, the trial status should change to succeeded.

What am I missing?

Thanks

Issue Analytics

State:
Created 2 years ago
Comments:14 (11 by maintainers)

Top GitHub Comments

1reaction

sjmccorm1993commented, Sep 13, 2022

@joaquingarciaatos After facing this same issue, I eventually realized the problem was that while the webhook service listens on port 443, the webhook pod listens on port 8443 and network traffic to 8443 was blocked in our infrastructure setup. Opening port 8443 on our system nodes solved the problem.

1reaction

joaquingarciaatoscommented, Sep 12, 2022

Dear @DnPlas :

In order to further debug (in case anyone runs into similar issues), I suggest to:

Check the above comments from @johnugeorge and @tenzen-y

Check the validation and webhook certificates

Change the failurePolicy to Fail, it helps to catch errors in the early stages and prevents things to fail silently

Could you please indicate what do you refer exactly with that? In which pod should I change the failurePolicy to fail? I have not found the pod in which I can do it.

Verify that the metrics container can communicate with the katib-db

Could you specify a bit more how I can verify that? Thank you in advance.

@charlescurt your issue looks similar to what I was experiencing. In my case, I had to ensure the caBundle was correctly populated in the validation and mutating webhooks. Check the certificates at /tmp/cert in the webhook server and make sure you can make POST requests to it using the certs in your webhooks.

Top Results From Across the Web

Metrics not reporting to Katib server - experiment timing out

The problem is that I cant report the metrics to katib server. ... see the pods log saying it's succeeded, but the metrics...

Tutorial | Operators | OpenShift Container Platform 4.8

This tutorial covers the default case of a single group API, but to change the ... Result{}, nil // Reconcile failed due to...

Error Back-off with Controller Runtime - stuartleeks.com

As a starting point, the code below is a simple Reconcile loop. ... DEBUG controller-runtime.controller Successfully Reconciled ...

Kubernetes Operators - Red Hat

We appreciate, but generally do not require, attribution. ... The reconcile failed due to an error and Kubernetes should requeue it to try....

VMware vSphere with Tanzu Release Notes

Some environments have reported pod creation intermittently failing with the following error ““Failed to get image”. Connection timeout with a No Route to...