question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

katib job failed to reach katib-db-manager to report metrics

See original GitHub issue

/kind bug

What steps did you take and what happened: Hi there,

I was running this example on kubeflow 1.3 rc0 from pipelines. however, this katib job didn’t go succeed when there one pod showing error (although the status from trials are all succeed). but its log shows:

kubectl logs -f median-stop-665dv8gm-d4p9l -n kubeflow-user -c metrics-logger-and-collector

...

F0419 03:30:52.909001      30 main.go:397] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.108.188.143:6789: i/o timeout"

and the IP (188.143) have belonged to katib-db-manager

kubeflow           katib-db-manager                                            ClusterIP      10.108.188.143   <none>                                                 6789/TCP

[A clear and concise description of what the bug is.]

What did you expect to happen: the median-stop job should just work

Anything else you would like to add:

I also noticed that there is one service under user namespace serves port 6789

median-stop-random                                        ClusterIP      10.101.57.222    <none>                                                 6789/TCP,6788/TCP

not sure which service katib the job is supposed to talk about? then I can dig more

Environment:

  • Katib: v0.11.0
  • Kubernetes version: 1.20.1
  • Kubeflow bundle: v1.3+rc0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
andreyvelichcommented, Apr 19, 2021

@hsinhoyeh From the logs above it seems that Trial’s pod can’t access the Katib DB Manager. Was your Katib DB manager pod always running during the Experiment run ?

0reactions
stale[bot]commented, Aug 21, 2021

This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Metrics not reporting to Katib server - experiment timing out
I want to use Katib to tune the hyper parameter using python (not by applying YAML file). The problem is that I cant...
Read more >
Hyperparameter Tuning - D2iQ Docs
Katib automates the process of hyperparameter tuning by running a pre-configured number of training jobs (known as trials) in parallel. Each trial evaluates...
Read more >
Katib Configuration Overview - Kubeflow
This guide describes Katib config — the Kubernetes Config Map that contains information about: Current metrics collectors ( key ...
Read more >
How Katib tunes hyperparameter automatically in a ... - Medium
Now Katib can automatically collect the metrics by a metrics collector sidecar container.
Read more >
kubeflowkatib/file-metrics-collector:v1beta1-5353cb5
kubeflowkatib /file-metrics-collector:v1beta1-5353cb5. Digest:sha256:580f073c0b45c01af4c018f46b15a38e45954f714ee901d4715cc8d84111fcf6. OS/ARCH. linux/amd64.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found