Trial job is succeeded but metrics are not reported, reconcile requeued
See original GitHub issue/kind bug
What steps did you take and what happened: I just tried to run the random experiment example, through the Katib UI (I also tried creating an experiment using python, but the same error occurs).
Following the experiment creation with the UI, I only changed the trial template (YAML), with this:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
sidecar.istio.io/inject: "false"
katib-metricscollector-injection: enabled
katib-metrics-collector-injection: enabled
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
katib-metricscollector-injection: enabled
katib-metrics-collector-injection: enabled
spec:
containers:
- name: training-container
image: docker.io/kubeflowkatib/mxnet-mnist:latest
command:
- "python3"
- "/opt/mxnet-mnist/mnist.py"
- "--batch-size=64"
- "--lr=${trialParameters.learningRate}"
- "--num-layers=${trialParameters.numberLayers}"
- "--optimizer=${trialParameters.optimizer}"
restartPolicy: Never
After a couple of minutes, the pods created by the job terminated, with the status Completed, and printed my objective metric as this:
2022-01-25T20:26:59Z INFO Epoch[9] Train-accuracy=0.993770
2022-01-25T20:26:59Z INFO Epoch[9] Time cost=5.344
2022-01-25T20:26:59Z INFO Epoch[9] Validation-accuracy=0.978802
But the experiment, suggestions, and trials keep with status Running, and new trials are not created.
When I check the katib-controller logs, I get the following msg:
{"level":"info","ts":1643142603.5533006,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-vzkjcznm"}
{"level":"info","ts":1643142603.633143,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-c9qr67ww"}
{"level":"info","ts":1643142603.655875,"logger":"trial-controller","msg":"Trial job is succeeded but metrics are not reported, reconcile requeued","Trial":"kubeflow-user-example-com/random-experiment-smw6p6rg"}
Additional Information:
kubectl get experiment random-experiment -o yaml -n kubeflow-user-example-com
Results in:
Output
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
creationTimestamp: "2022-01-25T20:25:22Z"
finalizers:
- update-prometheus-metrics
generation: 1
name: random-experiment
namespace: kubeflow-user-example-com
resourceVersion: "126860285"
uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
spec:
algorithm:
algorithmName: random
maxFailedTrialCount: 3
maxTrialCount: 12
metricsCollectorSpec:
collector:
kind: StdOut
objective:
additionalMetricNames:
- Train-accuracy
goal: 0.05
objectiveMetricName: Validation-accuracy
type: maximize
parallelTrialCount: 3
parameters:
- feasibleSpace:
max: "0.03"
min: "0.01"
step: "0.01"
name: lr
parameterType: double
- feasibleSpace:
max: "64"
min: "1"
step: "1"
name: num-layers
parameterType: int
- feasibleSpace:
list:
- sgd
- adam
- ftrl
name: optimizer
parameterType: categorical
resumePolicy: LongRunning
trialTemplate:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
primaryContainerName: training-container
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
trialParameters:
- name: learningRate
reference: lr
- name: numberLayers
reference: num-layers
- name: optimizer
reference: optimizer
trialSpec:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
spec:
template:
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python3
- /opt/mxnet-mnist/mnist.py
- --batch-size=64
- --lr=${trialParameters.learningRate}
- --num-layers=${trialParameters.numberLayers}
- --optimizer=${trialParameters.optimizer}
image: docker.io/kubeflowkatib/mxnet-mnist:latest
name: training-container
restartPolicy: Never
status:
conditions:
- lastTransitionTime: "2022-01-25T20:25:22Z"
lastUpdateTime: "2022-01-25T20:25:22Z"
message: Experiment is created
reason: ExperimentCreated
status: "True"
type: Created
- lastTransitionTime: "2022-01-25T20:25:44Z"
lastUpdateTime: "2022-01-25T20:25:44Z"
message: Experiment is running
reason: ExperimentRunning
status: "True"
type: Running
currentOptimalTrial:
observation: {}
runningTrialList:
- random-experiment-smw6p6rg
- random-experiment-c9qr67ww
- random-experiment-vzkjcznm
startTime: "2022-01-25T20:25:22Z"
trials: 3
trialsRunning: 3
and
kubectl get trial random-experiment-c9qr67ww -n kubeflow-user-example-com -o yaml
Results in:
Output
apiVersion: kubeflow.org/v1beta1
kind: Trial
metadata:
creationTimestamp: "2022-01-25T20:25:44Z"
finalizers:
- clean-metrics-in-db
generation: 1
labels:
katib.kubeflow.org/experiment: random-experiment
name: random-experiment-c9qr67ww
namespace: kubeflow-user-example-com
ownerReferences:
- apiVersion: kubeflow.org/v1beta1
blockOwnerDeletion: true
controller: true
kind: Experiment
name: random-experiment
uid: 91283c82-46e4-4b8b-9a3a-5cb730ad41d6
resourceVersion: "126860266"
uid: 24a7d825-2737-4d6f-8ba8-5e22d776443f
spec:
failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
metricsCollector:
collector:
kind: StdOut
objective:
additionalMetricNames:
- Train-accuracy
goal: 0.05
objectiveMetricName: Validation-accuracy
type: maximize
parameterAssignments:
- name: lr
value: "0.018768621111940782"
- name: num-layers
value: "7"
- name: optimizer
value: sgd
primaryContainerName: training-container
runSpec:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
name: random-experiment-c9qr67ww
namespace: kubeflow-user-example-com
spec:
template:
metadata:
annotations:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
labels:
katib-metrics-collector-injection: enabled
katib-metricscollector-injection: enabled
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python3
- /opt/mxnet-mnist/mnist.py
- --batch-size=64
- --lr=0.018768621111940782
- --num-layers=7
- --optimizer=sgd
image: docker.io/kubeflowkatib/mxnet-mnist:latest
name: training-container
restartPolicy: Never
successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
status:
conditions:
- lastTransitionTime: "2022-01-25T20:25:44Z"
lastUpdateTime: "2022-01-25T20:25:44Z"
message: Trial is created
reason: TrialCreated
status: "True"
type: Created
- lastTransitionTime: "2022-01-25T20:25:44Z"
lastUpdateTime: "2022-01-25T20:25:44Z"
message: Trial is running
reason: TrialRunning
status: "True"
type: Running
startTime: "2022-01-25T20:25:44Z"
What did you expect to happen: Ideally, once the metrics are captured and the goal/maxTrial is reached, the trial status should change to succeeded.
What am I missing?
Thanks
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (11 by maintainers)
Top GitHub Comments
@joaquingarciaatos After facing this same issue, I eventually realized the problem was that while the webhook service listens on port 443, the webhook pod listens on port 8443 and network traffic to 8443 was blocked in our infrastructure setup. Opening port 8443 on our system nodes solved the problem.
Dear @DnPlas :
Could you please indicate what do you refer exactly with that? In which pod should I change the failurePolicy to fail? I have not found the pod in which I can do it.
Could you specify a bit more how I can verify that? Thank you in advance.