question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Standalone Katib deployment: creating a new experiment fails due to the MutatingWebhook timing out

See original GitHub issue

/kind bug

Hi I am trying to setup a standalone deployment of Katib v1beta1 on GKE. It is very possible I am doing something totally wrong here, or missing something obvious and that’s why I’m coming to you for help.

TL;DR: I have tried deploying katib via the deploy.sh script and also via Terraform. No matter how I deploy everything seems fine, all of katib-controller, katib-db-manager, katib-mysql and katib-ui pods are up and running, logs look clean. Then I try and submit one of the sample experiments and I get this timeout error:

$ kubectl apply -f examples/v1beta1/grid-example.yaml 
    Error from server (InternalError): error when creating "examples/v1beta1/grid-example.yaml": 
    Internal error occurred: failed calling webhook "mutating.experiment.katib.kubeflow.org": 
    Post https://katib-controller.kubeflow.svc:443/mutate-experiments?timeout=30s: context deadline exceeded

What I’ve tried:

I tried following all the debugging steps in #1160 (closest to this issue AFAIK) and didn’t really get anywhere. The webhook itself seems to be setup (admittedly I cant say if it’s correct or not):

$ kubectl describe MutatingWebhookConfiguration katib-mutating-webhook-config
Name:         katib-mutating-webhook-config
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  admissionregistration.k8s.io/v1beta1
Kind:         MutatingWebhookConfiguration
Metadata:
  Creation Timestamp:  2020-07-09T00:24:58Z
  Generation:          1
  Resource Version:    17496
  Self Link:           /apis/admissionregistration.k8s.io/v1beta1/mutatingwebhookconfigurations/katib-mutating-webhook-config
  UID:                 9eaca06d-c17a-11ea-ac7b-42010a000066
Webhooks:
  Admission Review Versions:
    v1beta1
  Client Config:
    Ca Bundle:  <hidden>
    Service:
      Name:        katib-controller
      Namespace:   kubeflow
      Path:        /mutate-experiments
  Failure Policy:  Fail
  Name:            mutating.experiment.katib.kubeflow.org
  Namespace Selector:
    Match Expressions:
      Key:       control-plane
      Operator:  DoesNotExist
  Rules:
    API Groups:
      kubeflow.org
    API Versions:
      v1beta1
    Operations:
      CREATE
      UPDATE
    Resources:
      experiments
    Scope:          *
  Side Effects:     Unknown
  Timeout Seconds:  30
  Admission Review Versions:
    v1beta1
  Client Config:
    Ca Bundle: <hidden>
    Service:
      Name:        katib-controller
      Namespace:   kubeflow
      Path:        /mutate-pods
  Failure Policy:  Ignore
  Name:            mutating.pod.katib.kubeflow.org
  Namespace Selector:
    Match Labels:
      Katib - Metricscollector - Injection:  enabled
  Rules:
    API Groups:
      
    API Versions:
      v1
    Operations:
      CREATE
    Resources:
      pods
    Scope:          *
  Side Effects:     Unknown
  Timeout Seconds:  30
Events:             <none>

I’ve tried multiple Kubernetes versions from: 1.14.10-gke.45 - 1.16.9-gke.6

Some things I have not tried:

  • installing via the GCP kubeflow script (the whole point is to get a standalone katib deployment)
  • installing v1alpha3 (not against it, but ideally we want the newest version)

Any and all help is appreciated! 😄

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
kylepadcommented, Jul 13, 2020

Hey thanks for your help! We are using a Private GKE Cluster with a VPC. Creating a firewall rule allowing traffic via TCP:8443 and specifying the source range as the master plane CIDR fixed the problem 👍

$ kubectl apply -f examples/v1beta1/grid-example.yaml 
experiment.kubeflow.org/grid-example created

Might be a good idea to document this somewhere. Could save somebody else a lot of time 🤷‍♂️

0reactions
andreyvelichcommented, Jul 9, 2020

@kylepad If you are using Private GKE cluster (you can check here: https://console.cloud.google.com/kubernetes, clicking on your cluster), it might be the same problem as here: https://github.com/kubernetes/kubernetes/issues/79739.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The dark side of Kubernetes admission webhooks
There is one problem with webhooks that make them more dangerous though: admission request failures also result in rejection by default.
Read more >
Resuming an Experiment - Katib - Kubeflow
This guide describes how to modify running experiments and restart completed experiments. You will learn about changing the experiment ...
Read more >
A Guide to Kubernetes Admission Controllers
What are Kubernetes admission controllers? · Why do I need admission controllers? · Example: Writing and Deploying an Admission Controller Webhook.
Read more >
Kale and Kubeflow in vSphere with Kubernetes - Cloud Advisors
Katib has it's own Admission Webhook. Is a Pod created because of a Trial, it'll be intercepted. The python code of the Pod...
Read more >
From Notebook to Kubeflow Pipelines with HP Tuning: A Data ...
The project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable, and scalable.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found