question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trials Pods are completed but never successful neither reused, metrics are not shown

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.]

I have tried to run the Hyperparameter Tuning v1beta1 examples from the official Github of Katib. https://github.com/kubeflow/katib/tree/master/examples/v1beta1/hp-tuning. The only thing I have changed has been the repository name (from kubeflow to joaquin-garcia), and I have tried both keeping enable and disable the sidecar injection (our cluster uses istio), as detailed in Step 3 in https://www.kubeflow.org/docs/components/katib/hyperparameter/ .

The problem is that each pod executes one Trial (one combination of parameters), and the trial is marked as completed but never as successful (in the Terminal neither in the UI), so the goal of the tool is not reached. I have checked that the algorithm is carried out in each pod, as the different epochs and metrics are shown in the terminal, but nothing is shown in the UI.

What did you expect to happen: I expected each pod to be rerun with a different combination of values for each of the parameters under study / tuning.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

  • In the katib-ui pod it is shown “Trial random-<pod_number> has no pipeline run.”
  • The UI interface always show this values: UI_Trials

Environment:

  • Katib version (check the Katib controller image version): 0.13.0
  • Kubernetes version: (kubectl version): Client v1.25.0 | Server v1.21.13
  • OS (uname -a): Linux microsoft-standard-WSL2 x86_64 x86_64 x86_64 GNU/Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:2
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
johnugeorgecommented, Sep 19, 2022

Sorry for late reply. Is it a fresh installation? Is it stale web hook configurations?

/cc @tenzen-y

1reaction
joaquingarciaatoscommented, Sep 12, 2022

Can you check #1795 (comment) ?

Dear @johnugeorge, thank you very much for your reply. I have checked the two points of your comment:

  1. The status of the katib-db-manager pod is “Running”, and I get the following logs:
I0909 12:47:11.994286       1 db.go:32] Using MySQL                                                                                                              
E0909 12:47:18.008245       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:23.000273       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:28.024249       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:33.016374       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:38.008362       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
I0909 12:47:42.010038       1 init.go:27] Initializing v1beta1 DB schema                                                                                                                 
I0909 12:47:42.028309       1 main.go:113] Start Katib manager: 0.0.0.0:6789

So even if the first pings failed, I understand everything is fine with katib-db-manager, right?

  1. I am not sure if you are referring to the pod metrics server. Logging it, it seems like it is not possible to create the tcp connection, so maybe that’s the reason why I can not see the metrics and the trials does not advance with new parameter values. That’s the log I get:
I0708 11:32:57.277833       1 serving.go:341] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0708 11:32:57.809105       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController                                                                                                   I0708 11:32:57.809141       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0708 11:32:57.809177       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0708 11:32:57.809211       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0708 11:32:57.809341       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0708 11:32:57.809349       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0708 11:32:57.809392       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0708 11:32:57.809557       1 secure_serving.go:197] Serving securely on :443                                                                                                                                I0708 11:32:57.809703       1 tlsconfig.go:240] Starting DynamicServingCertificateController                                                                                                                 I0708 11:32:57.909691       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file                                                       ││ I0708 11:32:57.909778       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file                                       I0708 11:32:57.909810       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController                                                                                               E0719 12:49:12.795252       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:49:27.794191       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:58:57.797287       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:59:12.789134       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:12.803248       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:27.799399       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:56.288725       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:11.288884       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:26.288453       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:41.289342       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:56.287960       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: i/o timeout" node="vtss06
E0719 13:10:11.289028       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"   

Btw, when logging the pod katib-controller, I got the following error:

2022/09/09 13:16:23 http: TLS handshake error from 10.42.0.0:59926: remote error: tls: bad certificate

I think my error has its origin in the way the certificates are generated, but I am not sure neither how to solve it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pod Lifecycle | Kubernetes
This page describes the lifecycle of a Pod. Pods follow a defined lifecycle, starting in the Pending phase, moving through Running if at...
Read more >
Automatically scaling pods with the horizontal pod autoscaler
After you create a horizontal pod autoscaler, OpenShift Container Platform begins to query the CPU and/or memory resource metrics on the pods.
Read more >
How to Debug Kubernetes Pending Pods and Scheduling ...
Learn how to debug Pending pods that fail to get scheduled due to resource constraints, taints, affinity rules, and other reasons.
Read more >
Kubernetes Container Lifecycle Events and Hooks - ContainIQ
The hook will not be invoked when a container is stopped because its pod successfully exited and became complete. You can't currently clean...
Read more >
National Commission on the BP Deepwater Horizon Oil Spill
we do not know—for instance, the blowout preventer, the last line of ... America into the global leader for safe and effective offshore...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found