Standalone Katib deployment: creating a new experiment fails due to the MutatingWebhook timing out
See original GitHub issue/kind bug
Hi I am trying to setup a standalone deployment of Katib v1beta1 on GKE. It is very possible I am doing something totally wrong here, or missing something obvious and that’s why I’m coming to you for help.
TL;DR: I have tried deploying katib via the deploy.sh
script and also via Terraform. No matter how I deploy everything seems fine, all of katib-controller
, katib-db-manager
, katib-mysql
and katib-ui
pods are up and running, logs look clean. Then I try and submit one of the sample experiments and I get this timeout error:
$ kubectl apply -f examples/v1beta1/grid-example.yaml
Error from server (InternalError): error when creating "examples/v1beta1/grid-example.yaml":
Internal error occurred: failed calling webhook "mutating.experiment.katib.kubeflow.org":
Post https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s: context deadline exceeded
What I’ve tried:
I tried following all the debugging steps in #1160 (closest to this issue AFAIK) and didn’t really get anywhere. The webhook itself seems to be setup (admittedly I cant say if it’s correct or not):
$ kubectl describe MutatingWebhookConfiguration katib-mutating-webhook-config
Name: katib-mutating-webhook-config
Namespace:
Labels: <none>
Annotations: <none>
API Version: admissionregistration.k8s.io/v1beta1
Kind: MutatingWebhookConfiguration
Metadata:
Creation Timestamp: 2020-07-09T00:24:58Z
Generation: 1
Resource Version: 17496
Self Link: /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/katib-mutating-webhook-config
UID: 9eaca06d-c17a-11ea-ac7b-42010a000066
Webhooks:
Admission Review Versions:
v1beta1
Client Config:
Ca Bundle: <hidden>
Service:
Name: katib-controller
Namespace: kubeflow
Path: /mutate-experiments
Failure Policy: Fail
Name: mutating.experiment.katib.kubeflow.org
Namespace Selector:
Match Expressions:
Key: control-plane
Operator: DoesNotExist
Rules:
API Groups:
kubeflow.org
API Versions:
v1beta1
Operations:
CREATE
UPDATE
Resources:
experiments
Scope: *
Side Effects: Unknown
Timeout Seconds: 30
Admission Review Versions:
v1beta1
Client Config:
Ca Bundle: <hidden>
Service:
Name: katib-controller
Namespace: kubeflow
Path: /mutate-pods
Failure Policy: Ignore
Name: mutating.pod.katib.kubeflow.org
Namespace Selector:
Match Labels:
Katib - Metricscollector - Injection: enabled
Rules:
API Groups:
API Versions:
v1
Operations:
CREATE
Resources:
pods
Scope: *
Side Effects: Unknown
Timeout Seconds: 30
Events: <none>
I’ve tried multiple Kubernetes versions from: 1.14.10-gke.45
- 1.16.9-gke.6
Some things I have not tried:
- installing via the GCP kubeflow script (the whole point is to get a standalone katib deployment)
- installing v1alpha3 (not against it, but ideally we want the newest version)
Any and all help is appreciated! 😄
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:6 (3 by maintainers)
Top GitHub Comments
Hey thanks for your help! We are using a Private GKE Cluster with a VPC. Creating a firewall rule allowing traffic via TCP:8443 and specifying the source range as the master plane CIDR fixed the problem 👍
Might be a good idea to document this somewhere. Could save somebody else a lot of time 🤷♂️
@kylepad If you are using Private GKE cluster (you can check here: https://console.cloud.google.com/kubernetes, clicking on your cluster), it might be the same problem as here: https://github.com/kubernetes/kubernetes/issues/79739.