kserve-controller-manager keeps failing
See original GitHub issue/kind bug
What steps did you take and what happened: kserve-controller-manager keeps failing and modifying inference service freezes
{"level":"info","ts":1656891745.6919084,"logger":"entrypoint","msg":"Setting up client for manager"}
{"level":"info","ts":1656891745.6920536,"logger":"entrypoint","msg":"Setting up manager"}
I0703 23:42:26.743080 1 request.go:665] Waited for 1.03096179s due to client-side throttling, not priority and fairness, request: GET:https://10.100.0.1:443/apis/networking.internal.knative.dev/v1alpha1?timeout=32s
{"level":"info","ts":1656891746.7958298,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":"127.0.0.1:8080"}
{"level":"info","ts":1656891746.7959857,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1656891746.7959967,"logger":"entrypoint","msg":"Setting up KServe v1alpha1 scheme"}
{"level":"info","ts":1656891746.7962148,"logger":"entrypoint","msg":"Setting up KServe v1beta1 scheme"}
{"level":"info","ts":1656891747.9052598,"logger":"entrypoint","msg":"Setting up core scheme"}
{"level":"info","ts":1656891747.9054022,"logger":"setup","msg":"Setting up v1beta1 controller"}
{"level":"info","ts":1656891747.9921937,"logger":"setup","msg":"Setting up v1beta1 TrainedModel controller"}
{"level":"info","ts":1656891747.9922853,"logger":"setup","msg":"Setting up InferenceGraph controller"}
{"level":"info","ts":1656891747.9923978,"logger":"entrypoint","msg":"setting up webhook server"}
{"level":"info","ts":1656891747.9924114,"logger":"entrypoint","msg":"registering webhooks to the webhook server"}
{"level":"info","ts":1656891747.9925187,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-pods"}
{"level":"info","ts":1656891747.9926226,"logger":"controller-runtime.builder","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"serving.kserve.io/v1alpha1, Kind=TrainedModel"}
{"level":"info","ts":1656891747.9926972,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"serving.kserve.io/v1alpha1, Kind=TrainedModel","path":"/validate-serving-kserve-io-v1alpha1-trainedmodel"}
{"level":"info","ts":1656891747.9930542,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-serving-kserve-io-v1alpha1-trainedmodel"}
{"level":"info","ts":1656891747.9932442,"logger":"controller-runtime.builder","msg":"skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called","GVK":"serving.kserve.io/v1alpha1, Kind=InferenceGraph"}
{"level":"info","ts":1656891747.9933252,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"serving.kserve.io/v1alpha1, Kind=InferenceGraph","path":"/validate-serving-kserve-io-v1alpha1-inferencegraph"}
{"level":"info","ts":1656891747.993419,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-serving-kserve-io-v1alpha1-inferencegraph"}
{"level":"info","ts":1656891747.9935417,"logger":"controller-runtime.builder","msg":"Registering a mutating webhook","GVK":"serving.kserve.io/v1beta1, Kind=InferenceService","path":"/mutate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9936569,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/mutate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9937623,"logger":"controller-runtime.builder","msg":"Registering a validating webhook","GVK":"serving.kserve.io/v1beta1, Kind=InferenceService","path":"/validate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9938784,"logger":"controller-runtime.webhook","msg":"registering webhook","path":"/validate-serving-kserve-io-v1beta1-inferenceservice"}
{"level":"info","ts":1656891747.9939938,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1656891747.9942937,"msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1656891747.9948297,"logger":"controller.inferenceservice","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.995057,"logger":"controller.inferenceservice","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.995248,"logger":"controller.inferenceservice","msg":"Starting Controller","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService"}
{"level":"info","ts":1656891747.9955359,"logger":"controller.trainedmodel","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.9955592,"logger":"controller.trainedmodel","msg":"Starting Controller","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"info","ts":1656891747.9958353,"logger":"controller.inferencegraph","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.9958618,"logger":"controller.inferencegraph","msg":"Starting EventSource","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","source":"kind source: /, Kind="}
{"level":"info","ts":1656891747.99587,"logger":"controller.inferencegraph","msg":"Starting Controller","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph"}
{"level":"info","ts":1656891747.9959214,"logger":"controller-runtime.webhook.webhooks","msg":"starting webhook server"}
{"level":"info","ts":1656891747.9961152,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1656891747.9963746,"logger":"controller-runtime.webhook","msg":"serving webhook server","host":"","port":9443}
{"level":"info","ts":1656891747.99785,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"info","ts":1656891748.2967606,"logger":"controller.inferenceservice","msg":"Starting workers","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService","worker count":1}
{"level":"info","ts":1656891748.296824,"logger":"controller.trainedmodel","msg":"Starting workers","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel","worker count":1}
{"level":"info","ts":1656891748.398182,"logger":"v1beta1Controllers.InferenceService","msg":"Inference service deployment mode ","deployment mode ":"RawDeployment"}
{"level":"info","ts":1656891748.3982296,"logger":"v1beta1Controllers.InferenceService","msg":"Reconciling inference service","apiVersion":"serving.kserve.io/v1beta1","isvc":"woorim"}
{"level":"info","ts":1656891748.6028721,"logger":"PredictorReconciler","msg":"Resolved container","container":"&Container{Name:kserve-container,Image:docker.io/kyuwoochoi/mlserver:0.1.1,Command:[],Args:[],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar{EnvVar{Name:MLSERVER_MODEL_NAME,Value:woorim,ValueFrom:nil,},EnvVar{Name:MLSERVER_MODEL_URI,Value:/mnt/models,ValueFrom:nil,},EnvVar{Name:MLSERVER_MODEL_IMPLEMENTATION,Value:mlserver_mlflow.MLflowRuntime,ValueFrom:nil,},EnvVar{Name:MLSERVER_HTTP_PORT,Value:8080,ValueFrom:nil,},EnvVar{Name:MLSERVER_GRPC_PORT,Value:9000,ValueFrom:nil,},EnvVar{Name:MODELS_DIR,Value:/mnt/models,ValueFrom:nil,},},Resources:ResourceRequirements{Limits:ResourceList{cpu: {{1 0} {<nil>} 1 DecimalSI},memory: {{2147483648 0} {<nil>} 2Gi BinarySI},},Requests:ResourceList{cpu: {{100 -3} {<nil>} 100m DecimalSI},memory: {{209715200 0} {<nil>} BinarySI},},},VolumeMounts:[]VolumeMount{},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:,ImagePullPolicy:,SecurityContext:nil,Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,}","podSpec":{"containers":[{"name":"kserve-container","image":"docker.io/kyuwoochoi/mlserver:0.1.1","env":[{"name":"MLSERVER_MODEL_NAME","value":"woorim"},{"name":"MLSERVER_MODEL_URI","value":"/mnt/models"},{"name":"MLSERVER_MODEL_IMPLEMENTATION","value":"mlserver_mlflow.MLflowRuntime"},{"name":"MLSERVER_HTTP_PORT","value":"8080"},{"name":"MLSERVER_GRPC_PORT","value":"9000"},{"name":"MODELS_DIR","value":"/mnt/models"}],"resources":{"limits":{"cpu":"1","memory":"2Gi"},"requests":{"cpu":"100m","memory":"200Mi"}}}]}}
{"level":"info","ts":1656891748.6033475,"logger":"DeploymentReconciler","msg":"deployment reconcile","checkResult":2,"err":null}
{"level":"info","ts":1656891748.704376,"logger":"ServiceReconciler","msg":"service reconcile","checkResult":2,"err":null}
{"level":"info","ts":1656891748.8063047,"logger":"HPAReconciler","msg":"service reconcile","checkResult":2,"err":null}
E0703 23:42:28.808931 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:30.309089 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:32.365735 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:37.251654 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:42:47.855457 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:43:00.716039 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
E0703 23:43:41.203619 1 reflector.go:138] pkg/mod/k8s.io/client-go@v0.22.3/tools/cache/reflector.go:167: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:kserve:kserve-controller-manager" cannot list resource "pods" in API group "" at the cluster scope
{"level":"error","ts":1656891868.2952135,"logger":"controller.inferencegraph","msg":"Could not wait for Cache to sync","reconciler group":"serving.kserve.io","reconciler kind":"InferenceGraph","error":"failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startRunnable.func1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.10.2/pkg/manager/internal.go:696"}
{"level":"info","ts":1656891868.2955046,"logger":"controller.trainedmodel","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"info","ts":1656891868.295512,"logger":"controller.inferenceservice","msg":"Shutdown signal received, waiting for all workers to finish","reconciler group":"serving.kserve.io","reconciler kind":"InferenceService"}
{"level":"info","ts":1656891868.2958367,"logger":"controller-runtime.webhook","msg":"shutting down webhook server"}
{"level":"info","ts":1656891868.2960558,"logger":"controller.trainedmodel","msg":"All workers finished","reconciler group":"serving.kserve.io","reconciler kind":"TrainedModel"}
{"level":"error","ts":1656891898.2959445,"logger":"entrypoint","msg":"unable to run the manager","error":"[failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced, failed waiting for all runnables to end within grace period of 30s: context deadline exceeded]","errorCauses":[{"error":"failed to wait for inferencegraph caches to sync: timed out waiting for cache to be synced"},{"error":"failed waiting for all runnables to end within grace period of 30s: context deadline exceeded"}]}
Environment:
- Istio Version: 1.12.8
- Knative Version: 1.5.0
- KFServing Version: /v0.9.0-rc0
- Kubeflow version: -
- Kfdef:[k8s_istio/istio_dex/gcp_basic_auth/gcp_iap/aws/aws_cognito/ibm] -
- Minikube version: -
- Kubernetes version: (use
kubectl version
): v1.22.10 - OS (e.g. from
/etc/os-release
): -
Issue Analytics
- State:
- Created a year ago
- Comments:13 (10 by maintainers)
Top Results From Across the Web
kserve-controller-manager error due to "failed to wait for ...
kserve-controller-manager keeps restarting. ... to run the manager","error":"failed to wait for inferencegraph caches to sync: timed out ...
Read more >Why does a Kubernetes controller manager fail to start after ...
This scenario occurs occasionally when the master node is restarted. To workaround this issue, complete the following steps: Delete the ...
Read more >Knative Serving overview
Knative Serving defines a set of objects as Kubernetes Custom Resource Definitions (CRDs). These resources are used to define and control how your...
Read more >Kube-Controller-Manager can't reach API: "Unauthorized"
[1] In these cases, the kube-controller-manager is unable to reach the API, getting an "Unauthorized" error [2]: cronjob_controller.go:124] Failed to ...
Read more >Issues and Workarounds
Symptom: Graphs on the Kubernetes Dashboard fail to load when displaying ... This prevents the "kubectl drain" from succeeding during the Kubernetes upgrade ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks! will try to replicate this.
able to replicate now, seems like the issue does not happen until you create inference graph