question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

kserve-controller-manager keeps failing

See original GitHub issue

/kind bug

What steps did you take and what happened: kserve-controller-manager keeps failing and modifying inference service freezes

{"level":"info","ts":1656891745.6919084,"logger":"entrypoint","msg":"Setting up client for manager"}
{"level":"info","ts":1656891745.6920536,"logger":"entrypoint","msg":"Setting up manager"}
I0703 23:42:26.743080       1 request.go:665] Waited for 1.03096179s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/networking.internal.knative.dev/v1alpha1?timeout=32s
{"level":"info","ts":1656891746.7958298,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1656891746.7959857,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1656891746.7959967,"logger":"entrypoint","msg":"Setting up KServe v1alpha1 scheme"}
{"level":"info","ts":1656891746.7962148,"logger":"entrypoint","msg":"Setting up KServe v1beta1 scheme"}
{"level":"info","ts":1656891747.9052598,"logger":"entrypoint","msg":"Setting up core scheme"}
{"level":"info","ts":1656891747.9054022,"logger":"setup","msg":"Setting up v1beta1 controller"}
{"level":"info","ts":1656891747.9921937,"logger":"setup","msg":"Setting up v1beta1 TrainedModel controller"}
{"level":"info","ts":1656891747.9922853,"logger":"setup","msg":"Setting up InferenceGraph controller"}
{"level":"info","ts":1656891747.9923978,"logger":"entrypoint","msg":"setting up webhook server"}
{"level":"info","ts":1656891747.9924114,"logger":"entrypoint","msg":"registering webhooks to the webhook server"}
{"level":"info","ts":1656891747.9925187,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pods"}
{"level":"info","ts":1656891747.9926226,"logger":"controller-runtime.builder","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"serving.kserve.io/v1alpha1, Kind=TrainedModel"}
{"level":"info","ts":1656891747.9926972,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"serving.kserve.io/v1alpha1, Kind=TrainedModel","path":"/validate-serving-kserve-io-v1alpha1-trainedmodel"}
{"level":"info","ts":1656891747.9930542,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-serving-kserve-io-v1alpha1-trainedmodel"}
{"level":"info","ts":1656891747.9932442,"logger":"controller-runtime.builder","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"serving.kserve.io/v1alpha1, Kind=InferenceGraph"}
{"level":"info","ts":1656891747.9933252,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"serving.kserve.io/v1alpha1, Kind=InferenceGraph","path":"/validate-serving-kserve-io-v1alpha1-inferencegraph"}
{"level":"info","ts":1656891747.993419,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-serving-kserve-io-v1alpha1-inferencegraph"}
{"level":"info","ts":1656891747.9935417,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"serving.kserve.io/v1beta1, Kind=InferenceService","path":"/mutate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9936569,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9937623,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"serving.kserve.io/v1beta1, Kind=InferenceService","path":"/validate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9938784,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9939938,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1656891747.9942937,"msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1656891747.9948297,"logger":"controller.inferenceservice","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.995057,"logger":"controller.inferenceservice","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.995248,"logger":"controller.inferenceservice","msg":"Starting Controller","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService"}
{"level":"info","ts":1656891747.9955359,"logger":"controller.trainedmodel","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.9955592,"logger":"controller.trainedmodel","msg":"Starting Controller","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"info","ts":1656891747.9958353,"logger":"controller.inferencegraph","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.9958618,"logger":"controller.inferencegraph","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.99587,"logger":"controller.inferencegraph","msg":"Starting Controller","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph"}
{"level":"info","ts":1656891747.9959214,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1656891747.9961152,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1656891747.9963746,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":1656891747.99785,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":1656891748.2967606,"logger":"controller.inferenceservice","msg":"Starting workers","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","worker count":1}
{"level":"info","ts":1656891748.296824,"logger":"controller.trainedmodel","msg":"Starting workers","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel","worker count":1}
{"level":"info","ts":1656891748.398182,"logger":"v1beta1Controllers.InferenceService","msg":"Inference service deployment mode ","deployment mode ":"RawDeployment"}
{"level":"info","ts":1656891748.3982296,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kserve.io/v1beta1","isvc":"woorim"}
{"level":"info","ts":1656891748.6028721,"logger":"PredictorReconciler","msg":"Resolved container","container":"&Container{Name:kserve-container,Image:docker.io/kyuwoochoi/mlserver:0.1.1,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:MLSERVER_MODEL_NAME,Value:woorim,ValueFrom:nil,},EnvVar{Name:MLSERVER_MODEL_URI,Value:/mnt/models,ValueFrom:nil,},EnvVar{Name:MLSERVER_MODEL_IMPLEMENTATION,Value:mlserver_mlflow.MLflowRuntime,ValueFrom:nil,},EnvVar{Name:MLSERVER_HTTP_PORT,Value:8080,ValueFrom:nil,},EnvVar{Name:MLSERVER_GRPC_PORT,Value:9000,ValueFrom:nil,},EnvVar{Name:MODELS_DIR,Value:/mnt/models,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{2147483648 0} {<nil>} 2Gi BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{209715200 0} {<nil>}  BinarySI},},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,}","podSpec":{"containers":[{"name":"kserve-container","image":"docker.io/kyuwoochoi/mlserver:0.1.1","env":[{"name":"MLSERVER_MODEL_NAME","value":"woorim"},{"name":"MLSERVER_MODEL_URI","value":"/mnt/models"},{"name":"MLSERVER_MODEL_IMPLEMENTATION","value":"mlserver_mlflow.MLflowRuntime"},{"name":"MLSERVER_HTTP_PORT","value":"8080"},{"name":"MLSERVER_GRPC_PORT","value":"9000"},{"name":"MODELS_DIR","value":"/mnt/models"}],"resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"100m","memory":"200Mi"}}}]}}
{"level":"info","ts":1656891748.6033475,"logger":"DeploymentReconciler","msg":"deployment reconcile","checkResult":2,"err":null}
{"level":"info","ts":1656891748.704376,"logger":"ServiceReconciler","msg":"service reconcile","checkResult":2,"err":null}
{"level":"info","ts":1656891748.8063047,"logger":"HPAReconciler","msg":"service reconcile","checkResult":2,"err":null}
E0703 23:42:28.808931       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:30.309089       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:32.365735       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:37.251654       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:47.855457       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:43:00.716039       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:43:41.203619       1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
{"level":"error","ts":1656891868.2952135,"logger":"controller.inferencegraph","msg":"Could not wait for Cache to sync","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","error":"failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/manager/internal.go:696"}
{"level":"info","ts":1656891868.2955046,"logger":"controller.trainedmodel","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"info","ts":1656891868.295512,"logger":"controller.inferenceservice","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService"}
{"level":"info","ts":1656891868.2958367,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
{"level":"info","ts":1656891868.2960558,"logger":"controller.trainedmodel","msg":"All workers finished","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"error","ts":1656891898.2959445,"logger":"entrypoint","msg":"unable to run the manager","error":"[failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]","errorCauses":[{"error":"failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced"},{"error":"failed waiting for all runnables to end within grace period of 30s: context deadline exceeded"}]}

Environment:

  • Istio Version: 1.12.8
  • Knative Version: 1.5.0
  • KFServing Version: /v0.9.0-rc0
  • Kubeflow version: -
  • Kfdef:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm] -
  • Minikube version: -
  • Kubernetes version: (use kubectl version): v1.22.10
  • OS (e.g. from /etc/os-release): -

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
yuzisuncommented, Jul 15, 2022

@cmaddalozzo Are you running 0.9 RC release?

@yuzisun Yes, I’ve installed 0.9 RC using the YAMLs provided on the release page (I am not using helm).

I am using a fresh kind cluster with K8s v1.23.4.

Thanks! will try to replicate this.

0reactions
yuzisuncommented, Jul 16, 2022

able to replicate now, seems like the issue does not happen until you create inference graph

{"level":"error","ts":1657987226.863448,"logger":"controller.inferencegraph","msg":"Could not wait for Cache to sync","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","error":"failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\tsigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\tsigs.k8s.io/controller-runtime@v0.10.2/pkg/manager/internal.go:696"}
{"level":"info","ts":1657987226.8638563,"logger":"controller.inferenceservice","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService"}
{"level":"info","ts":1657987226.8639836,"logger":"controller.trainedmodel","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"info","ts":1657987226.8641098,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
{"level":"info","ts":1657987226.8642988,"logger":"controller.inferenceservice","msg":"All workers finished","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService"}
{"level":"info","ts":1657987226.8643909,"logger":"controller.trainedmodel","msg":"All workers finished","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"error","ts":1657987226.865376,"logger":"entrypoint","msg":"unable to run the manager","error":"failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced"}
Read more comments on GitHub >

github_iconTop Results From Across the Web

kserve-controller-manager error due to "failed to wait for ...
kserve-controller-manager keeps restarting. ... to run the manager","error":"failed to wait for inferencegraph caches to sync: timed out ...
Read more >
Why does a Kubernetes controller manager fail to start after ...
This scenario occurs occasionally when the master node is restarted. To workaround this issue, complete the following steps: Delete the ...
Read more >
Knative Serving overview
Knative Serving defines a set of objects as Kubernetes Custom Resource Definitions (CRDs). These resources are used to define and control how your...
Read more >
Kube-Controller-Manager can't reach API: "Unauthorized"
[1] In these cases, the kube-controller-manager is unable to reach the API, getting an "Unauthorized" error [2]: cronjob_controller.go:124] Failed to ...
Read more >
Issues and Workarounds
Symptom: Graphs on the Kubernetes Dashboard fail to load when displaying ... This prevents the "kubectl drain" from succeeding during the Kubernetes upgrade ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found