question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Uber Issue: KFServing admission hook causing widespread issues because its a global admission hook

See original GitHub issue

/kind bug

We are getting lots of reports about problems caused because the KFServing admission hook is unavailable preventing pods from being created. The error message looks like the following

4m58s       Warning   FailedCreate                   replicaset/activator-5484756f7b          Error creating: Internal error occurred: failed calling webhook "inferenceservice.kfserving-webhook-server.pod-mutator": Post https://kfserving-webhook-server-service.kubeflow.svc:443/mutate-pods?timeout=30s: service "kfserving-webhook-server-service" not found

Here’s my understanding

  • Currently AdmissionHooks can not be scoped by label; so a pod admission hook is being applied to all pods

  • The KFServing Admission Hooks is being applied to all pods and then in the hook itself it checks whether the pod belongs to a KFServing resource and if it does applies the hook

  • However, if the KFServing web hook deployment is unavailable pod creation can be blocked

  • For a variety of reasons we are reaching into a deadlock state where

    • The WebHook is defined but the deployment for the hook is not defined so calls to the admission hook will fail
    • Pod creation now fails because the webhook is not defined

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:43 (17 by maintainers)

github_iconTop GitHub Comments

5reactions
yuzisuncommented, Jul 28, 2020

@maganaluis We need to use object selector on the mutating webhook configuration so that only kfserving labelled pods go through the KFServing pod mutator, the problem is that object selector is only supported kubernetes 1.15+ while kubeflow’s minimal requirement is still kubernetes 1.14. If you are on kubernetes 1.15+ you can use following command to solve the issue.

kubectl patch mutatingwebhookconfiguration inferenceservice.serving.kubeflow.org --patch '{"webhooks":[{"name": "inferenceservice.kfserving-webhook-server.pod-mutator","objectSelector":{"matchExpressions":[{"key":"serving.kubeflow.org/inferenceservice", "operator": "Exists"}]}}]}'
5reactions
jlewicommented, Nov 24, 2019

Possible fixes

  1. Add the label control-plane to the kubeflow namespace

    kubectl label namespace kubeflow control-plane=true
    

1. Change the namespaceSelector to be opt in; match namespaces with specific labels

    * This won't work because the changes will be overwritten when the controller restarts because the controller creates the webhook

Ref: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-namespaceselector

Possible Work Arounds

  • Add the label control-plane to the kubeflow namespace
  • Update the inferservice webhook to change the namespace selector to be opt in.

A possible recipe

  1. Get the inference spec

    kubectl -n kubeflow get MutatingWebHookConfiguration inferenceservice.serving.kubeflow.org -o yaml > /tmp/inferceservice.yaml
    
  2. Change the matchSelector

    namespaceSelector:
      matchLabels:
         serving.kubeflow.org: "true"
    
  3. Apply it

    kubectl apply -f /tmp/inferenceservice.yaml
    
  4. Label any namespaces in which you want to use KFServing as

    kubectl label namespace ${NAMESPACE} serving.kubeflow.org=true
    
Read more comments on GitHub >

github_iconTop Results From Across the Web

Uber Issue: KFServing admission hook causing ... - GitHub
kind bug We are getting lots of reports about problems caused because the KFServing admission hook is unavailable preventing pods from being ...
Read more >
The dark side of Kubernetes admission webhooks
Admission webhooks are widely used in the Kubernetes world, but people often don't know how easily a faulty webhook can cause unwanted outages ......
Read more >
KubeCon + CloudNativeCon Europe 2021 Virtual: Full Schedule
Join us at Build with GKE + Anthos, hosted alongside KubeCon + CloudNativeCon Europe 2021, to learn what is new in the world...
Read more >
Proceedings of the 2020 USENIX Conference on Operational ...
Jairam Ranganathan, Uber ... Managing the ML production lifecycle is a necessity for wide-scale ... In the case of 8 inference, this causes....
Read more >
Managing Cloud-Native Data on Kubernetes - Portworx
sounds incomplete because it is. Breaking up your application components into dif4 ferent control planes creates more complexity and is ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found