Trials Pods are completed but never successful neither reused, metrics are not shown
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
I have tried to run the Hyperparameter Tuning v1beta1 examples from the official Github of Katib. https://github.com/kubeflow/katib/tree/master/examples/v1beta1/hp-tuning. The only thing I have changed has been the repository name (from kubeflow to joaquin-garcia), and I have tried both keeping enable and disable the sidecar injection (our cluster uses istio), as detailed in Step 3 in https://www.kubeflow.org/docs/components/katib/hyperparameter/ .
The problem is that each pod executes one Trial (one combination of parameters), and the trial is marked as completed but never as successful (in the Terminal neither in the UI), so the goal of the tool is not reached. I have checked that the algorithm is carried out in each pod, as the different epochs and metrics are shown in the terminal, but nothing is shown in the UI.
What did you expect to happen: I expected each pod to be rerun with a different combination of values for each of the parameters under study / tuning.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
- In the katib-ui pod it is shown “Trial random-<pod_number> has no pipeline run.”
- The UI interface always show this values:
Environment:
- Katib version (check the Katib controller image version): 0.13.0
- Kubernetes version: (
kubectl version
): Client v1.25.0 | Server v1.21.13 - OS (
uname -a
): Linux microsoft-standard-WSL2 x86_64 x86_64 x86_64 GNU/Linux
Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍
Issue Analytics
- State:
- Created a year ago
- Reactions:2
- Comments:5 (3 by maintainers)
Top GitHub Comments
Sorry for late reply. Is it a fresh installation? Is it stale web hook configurations?
/cc @tenzen-y
Dear @johnugeorge, thank you very much for your reply. I have checked the two points of your comment:
So even if the first pings failed, I understand everything is fine with
katib-db-manager
, right?metrics server
. Logging it, it seems like it is not possible to create the tcp connection, so maybe that’s the reason why I can not see the metrics and the trials does not advance with new parameter values. That’s the log I get:Btw, when logging the pod
katib-controller
, I got the following error:I think my error has its origin in the way the certificates are generated, but I am not sure neither how to solve it.