Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

katib job failed to reach katib-db-manager to report metrics

See original GitHub issue

/kind bug

What steps did you take and what happened: Hi there,

I was running this example on kubeflow 1.3 rc0 from pipelines. however, this katib job didn’t go succeed when there one pod showing error (although the status from trials are all succeed). but its log shows:

kubectl logs -f median-stop-665dv8gm-d4p9l -n kubeflow-user -c metrics-logger-and-collector

...

F0419 03:30:52.909001      30 main.go:397] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.108.188.143:6789: i/o timeout"

and the IP (188.143) have belonged to katib-db-manager

kubeflow           katib-db-manager                                            ClusterIP      10.108.188.143   <none>                                                 6789/TCP

[A clear and concise description of what the bug is.]

What did you expect to happen: the median-stop job should just work

Anything else you would like to add:

I also noticed that there is one service under user namespace serves port 6789

median-stop-random                                        ClusterIP      10.101.57.222    <none>                                                 6789/TCP,6788/TCP

not sure which service katib the job is supposed to talk about? then I can dig more

Environment:

Katib: v0.11.0
Kubernetes version: 1.20.1
Kubeflow bundle: v1.3+rc0

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

andreyvelichcommented, Apr 19, 2021

@hsinhoyeh From the logs above it seems that Trial’s pod can’t access the Katib DB Manager. Was your Katib DB manager pod always running during the Experiment run ?

0reactions

stale[bot]commented, Aug 21, 2021

This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.

Top Results From Across the Web

Metrics not reporting to Katib server - experiment timing out

I want to use Katib to tune the hyper parameter using python (not by applying YAML file). The problem is that I cant...

Hyperparameter Tuning - D2iQ Docs

Katib automates the process of hyperparameter tuning by running a pre-configured number of training jobs (known as trials) in parallel. Each trial evaluates...

Katib Configuration Overview - Kubeflow

This guide describes Katib config — the Kubernetes Config Map that contains information about: Current metrics collectors ( key ...

How Katib tunes hyperparameter automatically in a ... - Medium

Now Katib can automatically collect the metrics by a metrics collector sidecar container.

kubeflowkatib/file-metrics-collector:v1beta1-5353cb5

kubeflowkatib /file-metrics-collector:v1beta1-5353cb5. Digest:sha256:580f073c0b45c01af4c018f46b15a38e45954f714ee901d4715cc8d84111fcf6. OS/ARCH. linux/amd64.