katib-controller in invalid memory address or nil pointer dereference
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.]
-
install the kubeflow 1.12 on kubernetes 1.16.15 by official kfctl
-
start a notebook and run the following script mnist-pipeline.txt
-
the katib-controller starts to be on the state of CrashLoopBack Off forever, and the following logs is found:
{“level”:“info”,“ts”:1615690131.710228,“logger”:“entrypoint”,“msg”:“Config:”,“experiment-suggestion-name”:“default”,“cert-local-filesystem”:false,“webhook-port”:8443,“metrics-addr”:“:8080”,“inject-security-context”:false,“enable-grpc-probe-in-suggestion”:true,“trial-resources”:[{“Group”:“batch”,“Version”:“v1”,“Kind”:“Job”},{“Group”:“kubeflow.org”,“Version”:“v1”,“Kind”:“TFJob”},{“Group”:“kubeflow.org”,“Version”:“v1”,“Kind”:“PyTorchJob”},{“Group”:“kubeflow.org”,“Version”:“v1”,“Kind”:“MPIJob”},{“Group”:“tekton.dev”,“Version”:“v1beta1”,“Kind”:“PipelineRun”}]} {“level”:“info”,“ts”:1615690131.8203886,“logger”:“entrypoint”,“msg”:“Registering Components.”} {“level”:“info”,“ts”:1615690131.821229,“logger”:“entrypoint”,“msg”:“Setting up controller”} {“level”:“info”,“ts”:1615690131.8212914,“logger”:“experiment-controller”,“msg”:“Using the default suggestion implementation”} {“level”:“info”,“ts”:1615690131.8215294,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“experiment-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.821822,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“experiment-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.8220415,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“experiment-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.8222158,“logger”:“experiment-controller”,“msg”:“Experiment controller created”} {“level”:“info”,“ts”:1615690131.8223069,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“suggestion-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.8223562,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“suggestion-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.822521,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“suggestion-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.8226724,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“suggestion-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.8228512,“logger”:“suggestion-controller”,“msg”:“Suggestion controller created”} {“level”:“info”,“ts”:1615690131.8230324,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“trial-controller”,“source”:“kind source: /, Kind=”} {“level”:“info”,“ts”:1615690131.8231058,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“trial-controller”,“source”:“kind source: batch/v1, Kind=Job”} {“level”:“info”,“ts”:1615690131.8232667,“logger”:“trial-controller”,“msg”:“Job watch added successfully”,“CRD Group”:“batch”,“CRD Version”:“v1”,“CRD Kind”:“Job”} {“level”:“info”,“ts”:1615690131.8233113,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“trial-controller”,“source”:“kind source: kubeflow.org/v1, Kind=TFJob”} {“level”:“info”,“ts”:1615690131.823487,“logger”:“trial-controller”,“msg”:“Job watch added successfully”,“CRD Group”:“kubeflow.org”,“CRD Version”:“v1”,“CRD Kind”:“TFJob”} {“level”:“info”,“ts”:1615690131.8235776,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“trial-controller”,“source”:“kind source: kubeflow.org/v1, Kind=PyTorchJob”} {“level”:“info”,“ts”:1615690131.8237944,“logger”:“trial-controller”,“msg”:“Job watch added successfully”,“CRD Group”:“kubeflow.org”,“CRD Version”:“v1”,“CRD Kind”:“PyTorchJob”} {“level”:“info”,“ts”:1615690131.823831,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“trial-controller”,“source”:“kind source: kubeflow.org/v1, Kind=MPIJob”} {“level”:“info”,“ts”:1615690131.8239534,“logger”:“trial-controller”,“msg”:“Job watch added successfully”,“CRD Group”:“kubeflow.org”,“CRD Version”:“v1”,“CRD Kind”:“MPIJob”} {“level”:“info”,“ts”:1615690131.8239853,“logger”:“kubebuilder.controller”,“msg”:“Starting EventSource”,“controller”:“trial-controller”,“source”:“kind source: tekton.dev/v1beta1, Kind=PipelineRun”} {“level”:“error”,“ts”:1615690131.8240302,“logger”:“kubebuilder.source”,“msg”:“if kind is a CRD, it should be installed before calling Start”,“kind”:{“Group”:“tekton.dev”,“Kind”:“PipelineRun”},“error”:“no matches for kind "PipelineRun" in version "tekton.dev/v1beta1"”,“stacktrace”:“github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:89\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Watch\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/trial.add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:106\ngithub.com/kubeflow/katib/pkg/controller.v1beta1/trial.Add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/trial/trial_controller.go:65\ngithub.com/kubeflow/katib/pkg/controller%2ev1beta1.AddToManager\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/controller.go:28\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:112\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204”} {“level”:“info”,“ts”:1615690131.8242824,“logger”:“trial-controller”,“msg”:“Job watch error. CRD might be missing. Please install CRD and restart katib-controller”,“CRD Group”:“tekton.dev”,“CRD Version”:“v1beta1”,“CRD Kind”:“PipelineRun”} {“level”:“info”,“ts”:1615690131.8243027,“logger”:“trial-controller”,“msg”:“Trial controller created”} {“level”:“info”,“ts”:1615690131.8243096,“logger”:“entrypoint”,“msg”:“Setting up webhooks”} {“level”:“info”,“ts”:1615690131.8245256,“logger”:“entrypoint”,“msg”:“Starting the Cmd.”} {“level”:“info”,“ts”:1615690131.9251847,“logger”:“kubebuilder.controller”,“msg”:“Starting Controller”,“controller”:“trial-controller”} {“level”:“info”,“ts”:1615690131.9252026,“logger”:“kubebuilder.controller”,“msg”:“Starting Controller”,“controller”:“suggestion-controller”} {“level”:“info”,“ts”:1615690131.9251676,“logger”:“kubebuilder.controller”,“msg”:“Starting Controller”,“controller”:“experiment-controller”} {“level”:“info”,“ts”:1615690131.9251678,“logger”:“kubebuilder.webhook”,“msg”:“installing webhook configuration in cluster”} {“level”:“info”,“ts”:1615690132.0258567,“logger”:“kubebuilder.controller”,“msg”:“Starting workers”,“controller”:“suggestion-controller”,“worker count”:1} {“level”:“info”,“ts”:1615690132.025887,“logger”:“kubebuilder.controller”,“msg”:“Starting workers”,“controller”:“trial-controller”,“worker count”:1} {“level”:“info”,“ts”:1615690132.0259328,“logger”:“kubebuilder.controller”,“msg”:“Starting workers”,“controller”:“experiment-controller”,“worker count”:1} E0314 02:48:52.027598 1 runtime.go:69] Observed a panic: “invalid memory address or nil pointer dereference” (runtime error: invalid memory address or nil pointer dereference) /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76 /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65 /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51 /usr/local/go/src/runtime/panic.go:969 /usr/local/go/src/runtime/panic.go:212 /usr/local/go/src/runtime/signal_unix.go:720 /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:294 /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:283 /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 /usr/local/go/src/runtime/asm_amd64.s:1374 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x11cf162]
goroutine 378 [running]: github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0) /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x10c panic(0x1507140, 0x2229490) /usr/local/go/src/runtime/panic.go:969 +0x1b9 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileTrials(0xc000403320, 0xc0003a0840, 0x2276968, 0x0, 0x0, 0xc0001be310, 0x0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:294 +0x142 github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000403320, 0xc0003a0840, 0x2276968, 0x0) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:283 +0x38b github.com/kubeflow/katib/pkg/controller.v1beta1/experiment.(*ReconcileExperiment).Reconcile(0xc000403320, 0xc000611f10, 0x8, 0xc000782c60, 0x2a, 0x203000, 0x203000, 0xc000062800, 0x7f173e2e2e00) /go/src/github.com/kubeflow/katib/pkg/controller.v1beta1/experiment/experiment_controller.go:239 +0x768 github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000544fa0, 0x18b6200) /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 +0x1de github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1() /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36 github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000d1a340) /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5f github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000d1a340, 0x3b9aca00, 0x0, 0x100000000000001, 0xc000578a80) /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0x105 github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000d1a340, 0x3b9aca00, 0xc000578a80) /go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d created by github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start /go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x331 invalid memory address or nil pointer dereference
What did you expect to happen:
The experiment is successful and kubeflow is running normally.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Kubeflow version (
kfctl version
): v1.2.0-0-gbc038f9 - Minikube version (
minikube version
): - Kubernetes version: (use
kubectl version
): 1.16 - OS (e.g. from
/etc/os-release
): Centos 7
Issue Analytics
- State:
- Created 3 years ago
- Comments:16 (6 by maintainers)
Top GitHub Comments
@andreyvelich Thanks for your help!
I believe,
MXJob
is just one of the distributive training operators that Kubeflow provides. Since Katib supports any Kubernetes resource as a Trial template you can easily useMXJob
instead of Kubernetes Job.We don’t have an example with running
MXJob
, but it would be great to have a such contribution.You can learn more about
MXJob
here: https://github.com/kubeflow/mxnet-operator.