Katib Controller failing
See original GitHub issue/kind bug
What steps did you take and what happened: [A clear and concise description of what the bug is.] I am getting the error log below, from katib-controller. It took very long time to delete the experiments and trials. But controller pod keeps failing though I deleted and restarted the pod. I ran an Katib Experiment with the script below. I submitted the yaml file from the Katib UI (generate).
apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
namespace: apple
labels:
controller-tools.k8s.io: "1.0"
name: transformer-experiment
spec:
objective:
type: maximize
goal: 0.8
objectiveMetricName: Train-accuracy
additionalMetricNames:
- Train-loss
algorithm:
algorithmName: random
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
metricsCollectorSpec:
collector:
kind: StdOut
parameters:
- name: --lr
parameterType: double
feasibleSpace:
min: "0.01"
max: "0.03"
- name: --dropout_rate
parameterType: double
feasibleSpace:
min: "0.005"
max: "0.020"
- name: --layer_count
parameterType: int
feasibleSpace:
min: "2"
max: "5"
- name: --d_model_count
parameterType: categorical
feasibleSpace:
list:
- "64"
- "128"
- "256"
trialTemplate:
goTemplate:
rawTemplate: |-
apiVersion: batch/v1
kind: Job
metadata:
name: {{.Trial}}
namespace: {{.NameSpace}}
spec:
template:
spec:
volumes:
- name: train-data
emptyDir: {}
containers:
- name: data-download
image: amazon/aws-cli
command:
- "aws s3 sync s3://<Our Bucket Name>/kubeflowdata.tar.gz /train-data"
volumeMounts:
- name: train-data
mountPath: /train-data
- name: {{.Trial}}
image: <My Image>
command:
- "cd /train-data"
- "ls"
- "python"
- "/opt/ml/src/main.py"
- "--train_batch=64"
- "--test_batch=64"
- "--num_workers=4"
volumeMounts:
- name: train-data
mountPath: /train-data
{{- with .HyperParameters}}
{{- range .}}
- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
restartPolicy: Never
{"level":"info","ts":1599829358.6809406,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","cert-local-filesystem":false,"webhook-port":8443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true}
{"level":"info","ts":1599829358.8558226,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1599829358.8564165,"logger":"entrypoint","msg":"Setting up controller"}
{"level":"info","ts":1599829358.856454,"logger":"experiment-controller","msg":"Using the default suggestion implementation"}
{"level":"info","ts":1599829358.8565354,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8566551,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8567514,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8568358,"logger":"experiment-controller","msg":"Experiment controller created"}
{"level":"info","ts":1599829358.8569014,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8569264,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8570082,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8571074,"logger":"suggestion-controller","msg":"Suggestion controller created"}
{"level":"info","ts":1599829358.8571756,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.857207,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8572953,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: batch/v1, Kind=Job"}
{"level":"info","ts":1599829358.8573985,"logger":"trial-controller","msg":"Job watch added successfully","CRD Kind":"Job"}
{"level":"info","ts":1599829358.8574154,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=TFJob"}
{"level":"info","ts":1599829358.8575017,"logger":"trial-controller","msg":"Job watch added successfully","CRD Kind":"TFJob"}
{"level":"info","ts":1599829358.8575182,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"}
{"level":"info","ts":1599829358.8575974,"logger":"trial-controller","msg":"Job watch added successfully","CRD Kind":"PyTorchJob"}
{"level":"info","ts":1599829358.857611,"logger":"trial-controller","msg":"Trial controller created"}
{"level":"info","ts":1599829358.8576157,"logger":"entrypoint","msg":"Setting up webhooks"}
{"level":"info","ts":1599829358.8577054,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1599829358.9580615,"logger":"kubebuilder.webhook","msg":"installing webhook configuration in cluster"}
{"level":"info","ts":1599829358.958065,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"experiment-controller"}
{"level":"info","ts":1599829358.9580858,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"suggestion-controller"}
{"level":"info","ts":1599829358.9581332,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"trial-controller"}
{"level":"info","ts":1599829359.0582125,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"experiment-controller","worker count":1}
{"level":"info","ts":1599829359.058568,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"suggestion-controller","worker count":1}
{"level":"info","ts":1599829359.0585876,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"trial-controller","worker count":1}
{"level":"info","ts":1599829359.0585961,"logger":"experiment-controller","msg":"Statistics","Experiment":"kubeflow/random-example","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":1599829359.058652,"logger":"experiment-controller","msg":"CreateTrials","Experiment":"kubeflow/random-example","addCount":3}
{"level":"info","ts":1599829359.0587487,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"kubeflow/random-example","Instance name":"random-example","suggestionRequestsCount":3}
E0911 13:02:39.059078 1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:969
/usr/local/go/src/runtime/panic.go:212
/usr/local/go/src/runtime/signal_unix.go:695
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:98
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:83
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:78
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_util.go:45
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:350
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:329
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:274
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:232
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1373
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x115dda2]
goroutine 344 [running]:
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x14ac460, 0x24d8060)
/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest.(*DefaultGenerator).getTrialTemplate(0xc000512400, 0xc0007b9680, 0x138b815, 0xa, 0xc00012e7a0)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:98 +0x42
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest.(*DefaultGenerator).getRunSpec(0xc000512400, 0xc0007b9680, 0xc000984df0, 0xe, 0xc0005cbfa0, 0x17, 0xc000984db8, 0x8, 0xc000088840, 0x3, ...)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:83 +0x67
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0xc000512400, 0xc0007b9680, 0xc000984df0, 0xe, 0xc0005cbfa0, 0x17, 0xc000984db8, 0x8, 0xc000088840, 0x3, ...)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:78 +0x144
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).createTrialInstance(0xc000a5bbc0, 0xc0007b9680, 0xc000dc18b0, 0xc000dae100, 0x3)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_util.go:45 +0x2e1
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).createTrials(0xc000a5bbc0, 0xc0007b9680, 0x251f2d8, 0x0, 0x0, 0x3, 0xc000914300, 0x11)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:350 +0x22b
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileTrials(0xc000a5bbc0, 0xc0007b9680, 0x251f2d8, 0x0, 0x0, 0xc00034ebd0, 0x0)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:329 +0x5d8
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000a5bbc0, 0xc0007b9680, 0x251f2d8, 0x0)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:274 +0x36d
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).Reconcile(0xc000a5bbc0, 0xc00050e0a8, 0x8, 0xc00050e0f0, 0xe, 0xc0001b5d40, 0x0, 0x0, 0x0)
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:232 +0x4fa
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004f9400, 0x0)
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 +0x1d6
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000db7b70)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5f
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000db7b70, 0x3b9aca00, 0x0, 0x100000000000001, 0xc00053fce0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xf8
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000db7b70, 0x3b9aca00, 0xc00053fce0)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x2fd
What did you expect to happen: I expected the volumes I specified to be mounted to each trials.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
Environment:
- Kubeflow version (
kfctl version
): 1.0 - Minikube version (
minikube version
): - Kubernetes version: (use
kubectl version
): 1.17 - OS (e.g. from
/etc/os-release
):
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
katib 2.2.2 · helm/cowboysysop
failureThreshold, Minimum consecutive failures for the liveness probe to be considered failed after having succeeded, 3. controller.livenessProbe.
Read more >Running an Experiment
Katib recognizes trials with a status of Failed or MetricsUnavailable as Failed trials, and if the number of failed trials reaches ...
Read more >How to upgrade Kubeflow from 1.4 to 1.6 - charm - Charmhub
Remove existing relations; Remove outdated charms; Upgrade the charms; Deploy additional charms; Add relations; Configure authentication; Known ...
Read more >kubeflowkatib/katib-controller:v1beta1-e086094
kubeflowkatib /katib-controller:v1beta1-e086094. Digest:sha256:69924cc119a941616d6bdba5ee7e8f0bba1feb61b0862699ad401cc79db36cf8. OS/ARCH. linux/amd64.
Read more >How Katib tunes hyperparameter automatically in a ...
maxFailedTrialCount: Some jobs with certain sets of hyperparameter maybe fail somehow. If the failed count of hyperparameter set exceeds ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, it is better to use new cluster. Or you can try to manually delete all Experiment and Trial resources by removing
metadata.finalizers
and, if experiment is completely deleted, delete controller pod.@neolunar7 I close this issue, feel free to re-open if you have any other questions.