katib job failed to reach katib-db-manager to report metrics
See original GitHub issue/kind bug
What steps did you take and what happened: Hi there,
I was running this example on kubeflow 1.3 rc0 from pipelines. however, this katib job didn’t go succeed when there one pod showing error (although the status from trials are all succeed). but its log shows:
kubectl logs -f median-stop-665dv8gm-d4p9l -n kubeflow-user -c metrics-logger-and-collector
...
F0419 03:30:52.909001 30 main.go:397] Failed to Report logs: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.108.188.143:6789: i/o timeout"
and the IP (188.143) have belonged to katib-db-manager
kubeflow katib-db-manager ClusterIP 10.108.188.143 <none> 6789/TCP
[A clear and concise description of what the bug is.]
What did you expect to happen:
the median-stop
job should just work
Anything else you would like to add:
I also noticed that there is one service under user namespace serves port 6789
median-stop-random ClusterIP 10.101.57.222 <none> 6789/TCP,6788/TCP
not sure which service katib the job is supposed to talk about? then I can dig more
Environment:
- Katib: v0.11.0
- Kubernetes version: 1.20.1
- Kubeflow bundle: v1.3+rc0
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Metrics not reporting to Katib server - experiment timing out
I want to use Katib to tune the hyper parameter using python (not by applying YAML file). The problem is that I cant...
Read more >Hyperparameter Tuning - D2iQ Docs
Katib automates the process of hyperparameter tuning by running a pre-configured number of training jobs (known as trials) in parallel. Each trial evaluates...
Read more >Katib Configuration Overview - Kubeflow
This guide describes Katib config — the Kubernetes Config Map that contains information about: Current metrics collectors ( key ...
Read more >How Katib tunes hyperparameter automatically in a ... - Medium
Now Katib can automatically collect the metrics by a metrics collector sidecar container.
Read more >kubeflowkatib/file-metrics-collector:v1beta1-5353cb5
kubeflowkatib /file-metrics-collector:v1beta1-5353cb5. Digest:sha256:580f073c0b45c01af4c018f46b15a38e45954f714ee901d4715cc8d84111fcf6. OS/ARCH. linux/amd64.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@hsinhoyeh From the logs above it seems that Trial’s pod can’t access the Katib DB Manager. Was your Katib DB manager pod always running during the Experiment run ?
This issue has been automatically closed because it has not had recent activity. Please comment “/reopen” to reopen it.