question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Katib Controller failing

See original GitHub issue

/kind bug

What steps did you take and what happened: [A clear and concise description of what the bug is.] I am getting the error log below, from katib-controller. It took very long time to delete the experiments and trials. But controller pod keeps failing though I deleted and restarted the pod. I ran an Katib Experiment with the script below. I submitted the yaml file from the Katib UI (generate).

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: apple
  labels:
    controller-tools.k8s.io: "1.0"
  name: transformer-experiment
spec:
  objective:
    type: maximize
    goal: 0.8
    objectiveMetricName: Train-accuracy
    additionalMetricNames:
      - Train-loss
  algorithm:
    algorithmName: random
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  metricsCollectorSpec:
    collector:
      kind: StdOut
  parameters:
    - name: --lr
      parameterType: double
      feasibleSpace:
        min: "0.01"
        max: "0.03"
    - name: --dropout_rate
      parameterType: double
      feasibleSpace:
        min: "0.005"
        max: "0.020"
    - name: --layer_count
      parameterType: int
      feasibleSpace:
        min: "2"
        max: "5"
    - name: --d_model_count
      parameterType: categorical
      feasibleSpace:
        list:
        - "64"
        - "128"
        - "256"
  trialTemplate:
    goTemplate:
        rawTemplate: |-
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: {{.Trial}}
            namespace: {{.NameSpace}}
          spec:
            template:
              spec:
                volumes:
                - name: train-data
                  emptyDir: {}
                containers:
                - name: data-download
                  image: amazon/aws-cli
                  command:
                  - "aws s3 sync s3://<Our Bucket Name>/kubeflowdata.tar.gz /train-data"
                  volumeMounts:
                  - name: train-data
                    mountPath: /train-data
                - name: {{.Trial}}
                  image: <My Image>
                  command:
                  - "cd /train-data"
                  - "ls"
                  - "python"
                  - "/opt/ml/src/main.py"
                  - "--train_batch=64"
                  - "--test_batch=64"
                  - "--num_workers=4"
                  volumeMounts:
                  - name: train-data
                    mountPath: /train-data
                  {{- with .HyperParameters}}
                  {{- range .}}
                  - "{{.Name}}={{.Value}}"
                  {{- end}}
                  {{- end}}
                restartPolicy: Never
{"level":"info","ts":1599829358.6809406,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","cert-local-filesystem":false,"webhook-port":8443,"metrics-addr":":8080","inject-security-context":false,"enable-grpc-probe-in-suggestion":true}
{"level":"info","ts":1599829358.8558226,"logger":"entrypoint","msg":"Registering Components."}
{"level":"info","ts":1599829358.8564165,"logger":"entrypoint","msg":"Setting up controller"}
{"level":"info","ts":1599829358.856454,"logger":"experiment-controller","msg":"Using the default suggestion implementation"}
{"level":"info","ts":1599829358.8565354,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8566551,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8567514,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8568358,"logger":"experiment-controller","msg":"Experiment controller created"}
{"level":"info","ts":1599829358.8569014,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8569264,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8570082,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"suggestion-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8571074,"logger":"suggestion-controller","msg":"Suggestion controller created"}
{"level":"info","ts":1599829358.8571756,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.857207,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1599829358.8572953,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: batch/v1, Kind=Job"}
{"level":"info","ts":1599829358.8573985,"logger":"trial-controller","msg":"Job watch added successfully","CRD Kind":"Job"}
{"level":"info","ts":1599829358.8574154,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=TFJob"}
{"level":"info","ts":1599829358.8575017,"logger":"trial-controller","msg":"Job watch added successfully","CRD Kind":"TFJob"}
{"level":"info","ts":1599829358.8575182,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"trial-controller","source":"kind source: kubeflow.org/v1, Kind=PyTorchJob"}
{"level":"info","ts":1599829358.8575974,"logger":"trial-controller","msg":"Job watch added successfully","CRD Kind":"PyTorchJob"}
{"level":"info","ts":1599829358.857611,"logger":"trial-controller","msg":"Trial  controller created"}
{"level":"info","ts":1599829358.8576157,"logger":"entrypoint","msg":"Setting up webhooks"}
{"level":"info","ts":1599829358.8577054,"logger":"entrypoint","msg":"Starting the Cmd."}
{"level":"info","ts":1599829358.9580615,"logger":"kubebuilder.webhook","msg":"installing webhook configuration in cluster"}
{"level":"info","ts":1599829358.958065,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"experiment-controller"}
{"level":"info","ts":1599829358.9580858,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"suggestion-controller"}
{"level":"info","ts":1599829358.9581332,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"trial-controller"}
{"level":"info","ts":1599829359.0582125,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"experiment-controller","worker count":1}
{"level":"info","ts":1599829359.058568,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"suggestion-controller","worker count":1}
{"level":"info","ts":1599829359.0585876,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"trial-controller","worker count":1}
{"level":"info","ts":1599829359.0585961,"logger":"experiment-controller","msg":"Statistics","Experiment":"kubeflow/random-example","requiredActiveCount":3,"parallelCount":3,"activeCount":0,"completedCount":0}
{"level":"info","ts":1599829359.058652,"logger":"experiment-controller","msg":"CreateTrials","Experiment":"kubeflow/random-example","addCount":3}
{"level":"info","ts":1599829359.0587487,"logger":"experiment-controller","msg":"GetOrCreateSuggestion","Experiment":"kubeflow/random-example","Instance name":"random-example","suggestionRequestsCount":3}
E0911 13:02:39.059078       1 runtime.go:69] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/usr/local/go/src/runtime/panic.go:969
/usr/local/go/src/runtime/panic.go:212
/usr/local/go/src/runtime/signal_unix.go:695
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:98
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:83
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:78
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_util.go:45
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:350
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:329
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:274
/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:232
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215
/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/usr/local/go/src/runtime/asm_amd64.s:1373
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x115dda2]
goroutine 344 [running]:
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x14ac460, 0x24d8060)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest.(*DefaultGenerator).getTrialTemplate(0xc000512400, 0xc0007b9680, 0x138b815, 0xa, 0xc00012e7a0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:98 +0x42
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest.(*DefaultGenerator).getRunSpec(0xc000512400, 0xc0007b9680, 0xc000984df0, 0xe, 0xc0005cbfa0, 0x17, 0xc000984db8, 0x8, 0xc000088840, 0x3, ...)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:83 +0x67
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest.(*DefaultGenerator).GetRunSpecWithHyperParameters(0xc000512400, 0xc0007b9680, 0xc000984df0, 0xe, 0xc0005cbfa0, 0x17, 0xc000984db8, 0x8, 0xc000088840, 0x3, ...)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/manifest/generator.go:78 +0x144
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).createTrialInstance(0xc000a5bbc0, 0xc0007b9680, 0xc000dc18b0, 0xc000dae100, 0x3)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_util.go:45 +0x2e1
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).createTrials(0xc000a5bbc0, 0xc0007b9680, 0x251f2d8, 0x0, 0x0, 0x3, 0xc000914300, 0x11)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:350 +0x22b
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileTrials(0xc000a5bbc0, 0xc0007b9680, 0x251f2d8, 0x0, 0x0, 0xc00034ebd0, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:329 +0x5d8
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).ReconcileExperiment(0xc000a5bbc0, 0xc0007b9680, 0x251f2d8, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:274 +0x36d
github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.(*ReconcileExperiment).Reconcile(0xc000a5bbc0, 0xc00050e0a8, 0x8, 0xc00050e0f0, 0xe, 0xc0001b5d40, 0x0, 0x0, 0x0)
	/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:232 +0x4fa
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0004f9400, 0x0)
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:215 +0x1d6
github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1()
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158 +0x36
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc000db7b70)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x5f
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc000db7b70, 0x3b9aca00, 0x0, 0x100000000000001, 0xc00053fce0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xf8
github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc000db7b70, 0x3b9aca00, 0xc00053fce0)
	/go/src/github.com/kubeflow/katib/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
	/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:157 +0x2fd

What did you expect to happen: I expected the volumes I specified to be mounted to each trials.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

  • Kubeflow version (kfctl version): 1.0
  • Minikube version (minikube version):
  • Kubernetes version: (use kubectl version): 1.17
  • OS (e.g. from /etc/os-release):

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
andreyvelichcommented, Sep 11, 2020

Yes, it is better to use new cluster. Or you can try to manually delete all Experiment and Trial resources by removing metadata.finalizers and, if experiment is completely deleted, delete controller pod.

0reactions
andreyvelichcommented, Nov 13, 2020

@neolunar7 I close this issue, feel free to re-open if you have any other questions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

katib 2.2.2 · helm/cowboysysop
failureThreshold, Minimum consecutive failures for the liveness probe to be considered failed after having succeeded, 3. controller.livenessProbe.
Read more >
Running an Experiment
Katib recognizes trials with a status of Failed or MetricsUnavailable as Failed trials, and if the number of failed trials reaches ...
Read more >
How to upgrade Kubeflow from 1.4 to 1.6 - charm - Charmhub
Remove existing relations; Remove outdated charms; Upgrade the charms; Deploy additional charms; Add relations; Configure authentication; Known ...
Read more >
kubeflowkatib/katib-controller:v1beta1-e086094
kubeflowkatib /katib-controller:v1beta1-e086094. Digest:sha256:69924cc119a941616d6bdba5ee7e8f0bba1feb61b0862699ad401cc79db36cf8. OS/ARCH. linux/amd64.
Read more >
How Katib tunes hyperparameter automatically in a ...
maxFailedTrialCount: Some jobs with certain sets of hyperparameter maybe fail somehow. If the failed count of hyperparameter set exceeds ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found