question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Grid Search stuck when parallelTrialCount < maxTrialCount

See original GitHub issue

/kind bug

What steps did you take and what happened:

I created an experiment to perform a grid search, set maxTrialCount equal to the size of the grid, and set parallelTrialCount < maxTrialCount.

The first set of parallel trials completes successfully, but at some point after that the experiment gets stuck. The suggestion shows requested > assigned and has the following warning: Warning ReconcileError 4s (x12 over 31s) suggestion-controller The response contains unexpected trials

What did you expect to happen: Experiment to execute trials for each point in the parameter space and then complete.

Anything else you would like to add:

I am running this image for the katib-controller: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:latest (I also tested with gcr.io/kubeflow-images-public/katib/v1alpha3/katib-controller:v0.8.0)

The issue seems like it is likely related to https://github.com/kubeflow/katib/issues/1494

If I set parallelTrialCount >= maxTrialCount (enabling all runs to execute immediately) the experiment succeeds.

Environment:

  • Kubeflow version (kfctl version): v1.0.1
  • Minikube version (minikube version): N/A
  • Kubernetes version: (use kubectl version): v1.18.17-gke.100
  • OS (e.g. from /etc/os-release):

experiment.yaml:

apiVersion: "kubeflow.org/v1alpha3"
kind: Experiment
metadata:
  namespace: sid
  labels:
    controller-tools.k8s.io: "1.0"
  name: minimal-grid
spec:
  objective:
    type: maximize
    goal: 999999
    objectiveMetricName: value
  resumePolicy: Never
  algorithm:
    algorithmName: grid
  parallelTrialCount: 5
  maxTrialCount: 15
  maxFailedTrialCount: 3
  parameters:
    - name: --param_1
      parameterType: int
      feasibleSpace:
        min: "0"
        max: "15"
  trialTemplate:
    goTemplate:
      rawTemplate: |-
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: {{.Trial}}
          namespace: {{.NameSpace}}
        spec:
          template:
            spec:
              containers:
              - name: {{.Trial}}
                image: <IMAGE>
                command:
                - "python3"
                - "return_param_value.py"
                {{- with .HyperParameters}}
                {{- range .}}
                - "{{.Name}}={{.Value}}"
                {{- end}}
                {{- end}}
              restartPolicy: Never

Result of kubectl describe suggestion <suggestion-name>:

Name:         minimal-grid
Namespace:    sid
Labels:       controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  kubeflow.org/v1alpha3
Kind:         Suggestion
Metadata:
  Creation Timestamp:  2021-05-18T05:58:39Z
  Generation:          9
  Managed Fields:
    API Version:  kubeflow.org/v1alpha3
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:labels:
          .:
          f:controller-tools.k8s.io:
        f:ownerReferences:
      f:spec:
        .:
        f:algorithmName:
        f:requests:
      f:status:
        .:
        f:conditions:
        f:startTime:
        f:suggestionCount:
        f:suggestions:
    Manager:    katib-controller
    Operation:  Update
    Time:       2021-05-18T05:58:56Z
  Owner References:
    API Version:           kubeflow.org/v1alpha3
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Experiment
    Name:                  minimal-grid
    UID:                   90e538b2-5858-45d2-af0e-38c0837d6a8d
  Resource Version:        69731263
  Self Link:               /apis/kubeflow.org/v1alpha3/namespaces/sid/suggestions/minimal-grid
  UID:                     3875d508-b3ee-4426-88a7-1b307c381249
Spec:
  Algorithm Name:  grid
  Requests:        13
Status:
  Conditions:
    Last Transition Time:  2021-05-18T05:58:39Z
    Last Update Time:      2021-05-18T05:58:39Z
    Message:               Suggestion is created
    Reason:                SuggestionCreated
    Status:                True
    Type:                  Created
    Last Transition Time:  2021-05-18T05:58:54Z
    Last Update Time:      2021-05-18T05:58:54Z
    Message:               Deployment is ready
    Reason:                DeploymentReady
    Status:                True
    Type:                  DeploymentReady
    Last Transition Time:  2021-05-18T05:58:55Z
    Last Update Time:      2021-05-18T05:58:55Z
    Message:               Suggestion is running
    Reason:                SuggestionRunning
    Status:                True
    Type:                  Running
  Start Time:              2021-05-18T05:58:39Z
  Suggestion Count:        8
  Suggestions:
    Name:  minimal-grid-dq75pd5j
    Parameter Assignments:
      Name:   --param_1
      Value:  0
    Name:     minimal-grid-jsl78k9b
    Parameter Assignments:
      Name:   --param_1
      Value:  1
    Name:     minimal-grid-nbhlss7f
    Parameter Assignments:
      Name:   --param_1
      Value:  2
    Name:     minimal-grid-p87q65dk
    Parameter Assignments:
      Name:   --param_1
      Value:  3
    Name:     minimal-grid-dt2s5jd5
    Parameter Assignments:
      Name:   --param_1
      Value:  4
    Name:     minimal-grid-jxtrdnk9
    Parameter Assignments:
      Name:   --param_1
      Value:  6
    Name:     minimal-grid-l2t2gpc8
    Parameter Assignments:
      Name:   --param_1
      Value:  9
    Name:     minimal-grid-4dpdlj2v
    Parameter Assignments:
      Name:   --param_1
      Value:  13
Events:
  Type     Reason          Age               From                   Message
  ----     ------          ----              ----                   -------
  Warning  ReconcileError  0s (x8 over 15s)  suggestion-controller  The response contains unexpected trials

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
johnugeorgecommented, May 19, 2021

I agree. This is a bug which leaks out certain trials.

0reactions
andreyvelichcommented, Jul 29, 2021

/area release

Read more comments on GitHub >

github_iconTop Results From Across the Web

GridSearchCV hangs when called multiple types within script
The issue here is with the tensorflow session. If a session is created in the parent process before GridSearchCV.fit() , it will hang...
Read more >
Grid Search - Hyperparameter tuning for TensorFlow using ...
It assumes nothing about the model and each trial can be run in parallel. Grid search does an exhaustive search over the entire...
Read more >
Katib Concepts — Rok 2.0 documentation
Search algorithm: The algorithm to use when searching for the optimal hyperparameter values. Trial¶. A trial is one iteration of the hyperparameter tuning ......
Read more >
Running an Experiment
If the maxTrialCount value is omitted, your experiment will be running until the objective ... Examples include random search, grid search, ...
Read more >
How to Grid Search Hyperparameters for Deep Learning ...
Grid search is a model hyperparameter optimization technique. ... I fit the grid search to the second dataset, the program got stuck there....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found